베이지안 최적화 기반의 HyperOpt를 이용한 하이퍼 파라미터 튜닝
GridSearch 방식은 파라미터 개수가 많을수록 시간이 오래 걸리는게 단점이다
XGBoost나 LightGBM은 성능은 뛰어나지만 파라미터 개수가 너무 많아서 대용량 데이터에 쓸 때 시간이 많이 걸린다
- LightGBM에 6가지 하이퍼 파라미터를 적용하려고 해보자
- max_depth=[10, 20, 30, 40, 50]
- num_leaves=4
- subsample=5
- min_child_weight=4
- colsample_bytree=5
- reg_alpha=3 인 경우 GridSearch는 6000회를 반복해야함
따라서 실무의 대용량 데이터에는 베이지안 최적화 기법을 사용함
베이지안 최적화
목적 함수 식을 제대로 모르는 블랙 박스 형태의 함수에서 최대 또는 최소 함수 반환값을 만드는 최적 입력값을 가능한 적은 시도를 통해 효과적으로 찾아주는 방식
- 베이지안 확률: 새로운 사건의 관측이나 샘플 데이터를 기반으로 사후 확률을 개선해 나가는 것
- 베이지안 최적화: 새로운 데이터를 입력받았을 때 최적 함수를 예측하는 사후 모델을 개선해 나감
- 대체 모델(surrogate model) / 획득 함수(acqusition function)
HyperOpt 사용하기
<hyperopt의 주요 로직>
- 입력변수명과 입력값의 검색 공간 설정 >> hp모듈을 이용하여 딕셔너리 형태로
- 목적함수 설정 >> 위의 딕셔너리를 인자로 받고 특정값을 반환하는 구조
- 목적 함수의 반환 최솟값을 가지는 최적 입력값 유추 >> 베이지안 최적화에 기반하여
In [5]:
from hyperopt import hp
#-10~10까지 1간격을 가지는 입력변수 x와 -15~15까지 1간격으로 입력변수 y 설정
search_space = {'x':hp.quniform('x', -10, 10, 1), 'y':hp.quniform('y', -15, 15, 1)}
from hyperopt import STATUS_OK
#목적함수 생성, 변수값과 검색공간 가지는 딕셔너리를 인자로 받고 특정값 반환
def objective_func(search_space):
x = search_space['x']
y = search_space['y']
retval = x**2 - 20*y
return retval
from hyperopt import fmin, tpe, Trials
import numpy as np
#입력결과값을 저장한 Trials 객체 생성
trial_val = Trials()
#목적함수와 최솟값을 반환하는 최적 입력변수값을 5번의 시도로 찾아냄
best_01 = fmin(fn=objective_func, space=search_space, algo=tpe.suggest, max_evals=5, trials=trial_val, rstate=np.random.default_rng(seed=0))
print('best:', best_01)
100%|██████████| 5/5 [00:00<00:00, 749.12trial/s, best loss: -224.0]
best: {'x': -4.0, 'y': 12.0}
- best: {'x': -4.0, 'y': 12.0}로 나옴 x는 0에 가까울수록 y는 15에 가까울수록 반환값이 최소에 근사한 것임
이번에는 max_evals=20으로 20번 수행하기
In [6]:
trial_val = Trials()
best_02 = fmin(fn=objective_func, space=search_space, algo=tpe.suggest, max_evals=20, trials=trial_val, rstate=np.random.default_rng(seed=0))
print('best:', best_02)
100%|██████████| 20/20 [00:00<00:00, 891.99trial/s, best loss: -296.0]
best: {'x': 2.0, 'y': 15.0}
- 조금 더 최적 최소값에 근사함
그리드 서치를 썼으면 x 21개, y 31개로 651회 반복했을텐데 완벽한 정답인 x=0을 도출하지 못했더라도 20회만으로 끝낼 수 있었음
- fmin() 함수 수행 시 인자로 들어가는 Trials 객체는 함수 반복 수행시마다 입력되는 변수값들과 반환값을 속성으로 가짐
- 주요 속성은 results와 vals가 있음
results는 리스트 형태, 리스트 내에는 딕셔너리 형태로 개별 원소를 가짐
vals는 딕셔너리 형태
In [7]:
print(trial_val.results)
[{'loss': -64.0, 'status': 'ok'}, {'loss': -184.0, 'status': 'ok'}, {'loss': 56.0, 'status': 'ok'}, {'loss': -224.0, 'status': 'ok'}, {'loss': 61.0, 'status': 'ok'}, {'loss': -296.0, 'status': 'ok'}, {'loss': -40.0, 'status': 'ok'}, {'loss': 281.0, 'status': 'ok'}, {'loss': 64.0, 'status': 'ok'}, {'loss': 100.0, 'status': 'ok'}, {'loss': 60.0, 'status': 'ok'}, {'loss': -39.0, 'status': 'ok'}, {'loss': 1.0, 'status': 'ok'}, {'loss': -164.0, 'status': 'ok'}, {'loss': 21.0, 'status': 'ok'}, {'loss': -56.0, 'status': 'ok'}, {'loss': 284.0, 'status': 'ok'}, {'loss': 176.0, 'status': 'ok'}, {'loss': -171.0, 'status': 'ok'}, {'loss': 0.0, 'status': 'ok'}]
In [8]:
print(trial_val.vals)
{'x': [-6.0, -4.0, 4.0, -4.0, 9.0, 2.0, 10.0, -9.0, -8.0, -0.0, -0.0, 1.0, 9.0, 6.0, 9.0, 2.0, -2.0, -4.0, 7.0, -0.0], 'y': [5.0, 10.0, -2.0, 12.0, 1.0, 15.0, 7.0, -10.0, 0.0, -5.0, -3.0, 2.0, 4.0, 10.0, 3.0, 3.0, -14.0, -8.0, 11.0, -0.0]}
- 이 값들을 그대로 보기는 불편함 >> DataFrame으로 만들어주기
In [9]:
import pandas as pd
#results에서 loss 키값에 해당하는 밸류들을 추출하여 리스트로 생성
losses = [loss_dict['loss'] for loss_dict in trial_val.results]
result_df = pd.DataFrame({'x':trial_val.vals['x'], 'y':trial_val.vals['y'], 'losses':losses})
result_df
Out[9]:
x | y | losses | |
---|---|---|---|
0 | -6.0 | 5.0 | -64.0 |
1 | -4.0 | 10.0 | -184.0 |
2 | 4.0 | -2.0 | 56.0 |
3 | -4.0 | 12.0 | -224.0 |
4 | 9.0 | 1.0 | 61.0 |
5 | 2.0 | 15.0 | -296.0 |
6 | 10.0 | 7.0 | -40.0 |
7 | -9.0 | -10.0 | 281.0 |
8 | -8.0 | 0.0 | 64.0 |
9 | -0.0 | -5.0 | 100.0 |
10 | -0.0 | -3.0 | 60.0 |
11 | 1.0 | 2.0 | -39.0 |
12 | 9.0 | 4.0 | 1.0 |
13 | 6.0 | 10.0 | -164.0 |
14 | 9.0 | 3.0 | 21.0 |
15 | 2.0 | 3.0 | -56.0 |
16 | -2.0 | -14.0 | 284.0 |
17 | -4.0 | -8.0 | 176.0 |
18 | 7.0 | 11.0 | -171.0 |
19 | -0.0 | -0.0 | 0.0 |
HyperOpt를 이용한 XGBoost 하이퍼 파라미터 최적화
- 주의할 점!목적함수는 최솟값을 반환하도록 최적화해야 하므로 좋은 성능 지표일수록 -1을 곱해줘야함
- 특정 파라미터는 정수값만 입력받는데 HyperOpt는 입력값, 반환값이 모두 실수형이라 형변환을 해줘야함
In [10]:
from lightgbm import LGBMClassifier
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
cancer_df = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
cancer_df['target']=dataset.target
X_features = cancer_df.iloc[:, :-1]
y_label = cancer_df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X_features, y_label, test_size=0.2, random_state=156)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=156)
In [11]:
from hyperopt import hp
#max_depth는 5~20을 1간격으로 min_child_weight는 1~2를 1간격으로
#colsample_bytree는 0.5~1사이, learning_rate는 0.01~0.2사이
xgb_search_space = {'max_depth': hp.quniform('max_depth', 5, 20, 1),
'min_child_weight': hp.quniform('min_child_weight', 1, 2, 1, ),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
'learning_rate': hp.uniform('learning_rate', 0.01, 0.2)}
- 주의사항을 고려하면서 목적함수 만들기수행시간 줄이기 위해 n_estimators=100으로 설정
- 3가지 교차검증세트로 정확도 반환할 수 있도록 cross_val_score 사용
In [16]:
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
from hyperopt import STATUS_OK
def objective_func(search_space):
xgb_clf = XGBClassifier(n_estimators=100, max_depth=int(search_space['max_depth']),
min_child_weight=int(search_space['min_child_weight']),
learning_rate=search_space['learning_rate'], colsample_bytree=search_space['colsample_bytree'],
eval_metric='logloss')
accuracy = cross_val_score(xgb_clf, X_train, y_train, scoring='accuracy', cv=3)
return {'loss':-1 * np.mean(accuracy), 'status':STATUS_OK}
In [18]:
from hyperopt import fmin, tpe, Trials
trial_val = Trials()
best = fmin(fn=objective_func, space=xgb_search_space, algo=tpe.suggest, max_evals=50, trials=trial_val, rstate=np.random.default_rng(seed=9))
print('best:', best)
100%|██████████| 50/50 [00:10<00:00, 4.93trial/s, best loss: -0.9670616939700244]
best: {'colsample_bytree': 0.9599446282177103, 'learning_rate': 0.15480405522751015, 'max_depth': 6.0, 'min_child_weight': 2.0}
- fmin()으로 추출된 최적 하이퍼 파라미터를 XGBClassifier에 인자로 넣기 전에 정수형 파라미터는 정수형으로 실수형 파라미터는 소수점 5번째까지만 나타내자
In [21]:
print('colsample_bytree:{0}, learning_rate:{1}, max_depth:{2}, min_child_weight:{3}'.format(round(best['colsample_bytree'], 5),
round(best['learning_rate'], 5),
int(best['max_depth']),
int(best['min_child_weight'])))
colsample_bytree:0.95994, learning_rate:0.1548, max_depth:6, min_child_weight:2
- 도출된 최적 하이퍼 파라미터로 XGBClassifier를 재학습한 후 성능 평가 확인해보자
- 조기 중단하고 n_estimators=400으로 증가시키기
In [22]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
def get_clf_eval(y_test, pred, pred_proba):
confusion = confusion_matrix(y_test, pred)
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)
roc_auc = roc_auc_score(y_test, pred_proba)
print('오차 행렬')
print(confusion)
print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f}, F1:{3:.4f}, AUC: {4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))
In [23]:
xgb_wrapper = XGBClassifier(n_estimators=400,
learning_rate=round(best['learning_rate'], 5),
max_depth=int(best['max_depth']),
min_child_weight=int(best['min_child_weight']),
colsample_bytree=round(best['colsample_bytree'], 5))
evals = [(X_tr, y_tr), (X_val, y_val)]
xgb_wrapper.fit(X_tr, y_tr, early_stopping_rounds=50, eval_metric='logloss', eval_set=evals, verbose=True)
preds = xgb_wrapper.predict(X_test)
pred_proba = xgb_wrapper.predict_proba(X_test)[:, 1]
get_clf_eval(y_test, preds, pred_proba)
[0] validation_0-logloss:0.56834 validation_1-logloss:0.60660
[1] validation_0-logloss:0.47552 validation_1-logloss:0.54538
[2] validation_0-logloss:0.40208 validation_1-logloss:0.48735
[3] validation_0-logloss:0.34468 validation_1-logloss:0.45698
[4] validation_0-logloss:0.29775 validation_1-logloss:0.41729
[5] validation_0-logloss:0.26004 validation_1-logloss:0.39167
[6] validation_0-logloss:0.22681 validation_1-logloss:0.36682
[7] validation_0-logloss:0.20096 validation_1-logloss:0.34593
[8] validation_0-logloss:0.17762 validation_1-logloss:0.33030
[9] validation_0-logloss:0.15762 validation_1-logloss:0.31918
[10] validation_0-logloss:0.14233 validation_1-logloss:0.30772
[11] validation_0-logloss:0.12769 validation_1-logloss:0.30104
[12] validation_0-logloss:0.11566 validation_1-logloss:0.29621
[13] validation_0-logloss:0.10479 validation_1-logloss:0.29157
[14] validation_0-logloss:0.09640 validation_1-logloss:0.28495
[15] validation_0-logloss:0.08707 validation_1-logloss:0.28055
[16] validation_0-logloss:0.08067 validation_1-logloss:0.27775
[17] validation_0-logloss:0.07468 validation_1-logloss:0.27470
[18] validation_0-logloss:0.06971 validation_1-logloss:0.27426
[19] validation_0-logloss:0.06376 validation_1-logloss:0.27298
[20] validation_0-logloss:0.05900 validation_1-logloss:0.27030
[21] validation_0-logloss:0.05483 validation_1-logloss:0.26467
[22] validation_0-logloss:0.05115 validation_1-logloss:0.26722
[23] validation_0-logloss:0.04855 validation_1-logloss:0.26117
[24] validation_0-logloss:0.04630 validation_1-logloss:0.26024
[25] validation_0-logloss:0.04365 validation_1-logloss:0.26456
[26] validation_0-logloss:0.04105 validation_1-logloss:0.26599
[27] validation_0-logloss:0.03936 validation_1-logloss:0.26629
[28] validation_0-logloss:0.03716 validation_1-logloss:0.27067
[29] validation_0-logloss:0.03521 validation_1-logloss:0.26713
[30] validation_0-logloss:0.03347 validation_1-logloss:0.26820
[31] validation_0-logloss:0.03219 validation_1-logloss:0.26912
[32] validation_0-logloss:0.03091 validation_1-logloss:0.26841
[33] validation_0-logloss:0.02956 validation_1-logloss:0.27270
[34] validation_0-logloss:0.02873 validation_1-logloss:0.27204
[35] validation_0-logloss:0.02796 validation_1-logloss:0.27389
[36] validation_0-logloss:0.02732 validation_1-logloss:0.27463
[37] validation_0-logloss:0.02668 validation_1-logloss:0.27186
[38] validation_0-logloss:0.02604 validation_1-logloss:0.27278
[39] validation_0-logloss:0.02552 validation_1-logloss:0.27527
[40] validation_0-logloss:0.02481 validation_1-logloss:0.27140
[41] validation_0-logloss:0.02426 validation_1-logloss:0.27243
[42] validation_0-logloss:0.02377 validation_1-logloss:0.27126
[43] validation_0-logloss:0.02352 validation_1-logloss:0.26914
[44] validation_0-logloss:0.02304 validation_1-logloss:0.27011
[45] validation_0-logloss:0.02281 validation_1-logloss:0.27312
[46] validation_0-logloss:0.02226 validation_1-logloss:0.27251
[47] validation_0-logloss:0.02182 validation_1-logloss:0.27348
[48] validation_0-logloss:0.02165 validation_1-logloss:0.27169
[49] validation_0-logloss:0.02147 validation_1-logloss:0.27391
[50] validation_0-logloss:0.02129 validation_1-logloss:0.27328
[51] validation_0-logloss:0.02086 validation_1-logloss:0.27040
[52] validation_0-logloss:0.02071 validation_1-logloss:0.26869
[53] validation_0-logloss:0.02055 validation_1-logloss:0.27083
[54] validation_0-logloss:0.02040 validation_1-logloss:0.27105
[55] validation_0-logloss:0.02026 validation_1-logloss:0.27354
[56] validation_0-logloss:0.02013 validation_1-logloss:0.27299
[57] validation_0-logloss:0.02000 validation_1-logloss:0.27293
[58] validation_0-logloss:0.01986 validation_1-logloss:0.27131
[59] validation_0-logloss:0.01972 validation_1-logloss:0.27341
[60] validation_0-logloss:0.01960 validation_1-logloss:0.27364
[61] validation_0-logloss:0.01948 validation_1-logloss:0.27206
[62] validation_0-logloss:0.01935 validation_1-logloss:0.27347
[63] validation_0-logloss:0.01923 validation_1-logloss:0.27544
[64] validation_0-logloss:0.01912 validation_1-logloss:0.27390
[65] validation_0-logloss:0.01900 validation_1-logloss:0.27140
[66] validation_0-logloss:0.01889 validation_1-logloss:0.27092
[67] validation_0-logloss:0.01878 validation_1-logloss:0.27285
[68] validation_0-logloss:0.01867 validation_1-logloss:0.27140
[69] validation_0-logloss:0.01857 validation_1-logloss:0.27161
[70] validation_0-logloss:0.01847 validation_1-logloss:0.27348
[71] validation_0-logloss:0.01837 validation_1-logloss:0.27204
[72] validation_0-logloss:0.01827 validation_1-logloss:0.27280
[73] validation_0-logloss:0.01817 validation_1-logloss:0.27014
[74] validation_0-logloss:0.01807 validation_1-logloss:0.27143
오차 행렬
[[34 3]
[ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740, F1:0.9677, AUC: 0.9895
'Data Science > 파이썬 머신러닝 완벽 가이드' 카테고리의 다른 글
[sklearn] (28) Kaggle 신용카드 사기 거래 탐지, Credit Card Fraud Classification (0) | 2023.06.01 |
---|---|
[sklearn] (27) Kaggle 산탄데르 고객 만족 예측(Santander Customer Satisfaction) (0) | 2023.05.31 |
[sklearn] (25) - LightGBM (0) | 2023.05.30 |
[sklearn] (24) 사이킷런 Wrapper XGBoost (0) | 2023.05.30 |
[sklearn] (23) - XGBoost (eXtra Gradient Boost) (0) | 2023.05.29 |