XGBoost(eXtra Gradient Boost)
- 기존 GBM보다 빠르게 학습을 완료할 수 있음
- 분류, 회귀에서 뛰어난 예측 성능을 보임
- XGBoost는 자체에 과적합 규제 기능이 있어서 강한 내구성을 가짐
- tree_pruning(가지치기)로 더 이상 긍정 이득이 없는 분할을 가지치기해서 분할 수를 줄임
- 반복 수행 시마다 내부적으로 교차검증을 수행, 최적화된 교차검증 횟수를 가질 수 있음
- 결손값을 자체 처리할 수 있음
XGBoost 패키지의 사이킷런 wrapper class는 XGBClassifier와 XGBRegressor를 제공
! 파이썬 래퍼 XGBoost 모듈: 초기의 독자적인 XGBoost 전용 파이썬 패키지
! 사이킷런 래퍼 XGBoost 모듈: 사이킷런과 연동되는 모듈
파이썬 래퍼 XGBoost 하이퍼 파라미터
(사이킷런 래퍼 모듈과 이름 규칙에 따라 파라미터명이 달라짐 주의하기)
- GBM과 유사한 하이퍼 파라미터를 가지고, 여기에 조기 중단(early stopping), 과적합 규제용도 파라미터가 추가됨
1) 주요 일반 파라미터: 디폴트값을 바꾸는 경우가 거의 없음
- booster: default = gbtree
- silent: default = 0, 출력 메세지 안나타내고 싶으면 1로 설정
- nthread: CPU의 실행 스레드 개수 조정, default는 전체 스레드 다 사용하는 것
2) 주요 부스터 파라미터: 트리 최적화, 부스팅, regularization 관련 파라미터
- eta: default = 0.3, GBM의 learning_rate와 같음, 0과 1사이의 값 지정
- (사이킷런 래퍼 클래스의 경우 default = 0.1, 0.01~0.2 선호)
- num_boost_rounds: GBM의 n_estimators와 같음
- min_child_weight: default = 1, 트리에서 추가적으로 가지를 나눌지를 결정하기 위해 필요한 데이터들의 weight 총합, 값이 클수록 분할을 자제, 과적합 조절용
- gamma: default = 0, 트리의 리프노드를 추가적으로 나눌지를 결정할 최소 손실 감소값, 해당값보다 loss가 감소된 경우 리프노드 분리, 값이 클수록 과적합 감소 효과
- max_depth: default = 6, 0을 지정하면 깊이에 제한이 없음, 값이 커지면 특정 피처 조건에 특화되어 조건이 만들어지므로 과적합 가능성, 보통 3~10 사이
- sub_sample: default = 1, GBM의 subsample과 동일, 과적합 제어를 위해 데이터 샘플링 비율을 정함, 0.5로 지정하면 절반을 사용하는 것, 보통 0.5~1 사이
- colsample_bytree: default = 1, GBM의 max_features와 유사, 트리 생성에 필요한 피처 칼럼을 임의로 샘플링하는데 사용, 과적합 조정
- lambda: default = 1, L2 Regularization 적용값, 피처 개수가 많을 경우에 적용 검토, 값이 클수록 과적합 감소 효과
- alpha: default = 0, L1 Regularization 적용값, 피처 개수가 많을 경우에 적용 검토, 값이 클수록 과적합 감소 효과
- scale_pos_weight: default = 1, 특정값으로 치우친 비대칭한 클래스로 구성된 데이터셋의 균형을 유지하기 위한 파라미터
3) 학습 태스크 파라미터
- objective: 최솟값을 가져야 할 손실 함수 정의
- binary:logistic: 이진분류일 때 적용
- multi:softmax: 다중분류일 때 적용
- multi:softprob: multi:softmax와 비슷, 개별 레이블에 해당되는 예측 확률을 반환
- eval_metric: 검증에 사용되는 함수 정의, 디폴트는 회귀에서 rmse, 분류에선 error
뛰어난 알고리즘일수록 파라미터를 튜닝할 필요가 적다
과적합 문제가 심각하다면?
- eta값 낮추는 동시에 num_round 높이기
- max_depth값 낮추기
- min_child_weight값 높이기
- gamma값 높이기
- subsample과 colsample_bytree 조정하기
기본 GBM에는 부족한 여러 가지 성능 향상 기능
- 조기 중단(early stopping): n_estimators에 지정한 부스팅 반복 횟수에 도달하지 않더라도 예측 오류가 더이상 개선되지 않으면 끝까지 수행하지 않고 중지, 수행 시간 개선
In [1]:
import xgboost
print(xgboost.__version__)
1.7.5
XGBoost를 사용한 위스콘신 유방암 예측
- 자체적으로 교차검증, 성능 평가, 피처 중요도 시각화 기능(plot_importance)을 가짐
In [4]:
import xgboost as xgb
from xgboost import plot_importance
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
dataset = load_breast_cancer()
features = dataset.data
labels = dataset.target
cancer_df = pd.DataFrame(features, columns=dataset.feature_names)
cancer_df['target'] = labels
cancer_df.head(3)
Out[4]:
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.8 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.6 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
1 | 20.57 | 17.77 | 132.9 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.8 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
2 | 19.69 | 21.25 | 130.0 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.5 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
3 rows × 31 columns
- 0: malignant 악성 / 1: benign 양성
In [5]:
print(dataset.target_names)
print(cancer_df['target'].value_counts())
['malignant' 'benign']
1 357
0 212
Name: target, dtype: int64
- XGBoost의 조기 중단 기능과 검증 성능 평가를 위해 학습용데이터 80%중의 90%를 최종학습용, 10%를 검증용으로 쓰자
- 피처용 데이터는 cancer_df의 처음부터 -1번째까지 슬라이싱
In [6]:
X_features = cancer_df.iloc[:, :-1]
y_label = cancer_df.iloc[:, -1]
#데이터분리
X_train, X_test, y_train, y_test = train_test_split(X_features, y_label, test_size=0.2, random_state=156)
#train 데이터들을 다시 쪼개서 검증용 데이터 만들기
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=156)
print(X_train.shape, X_test.shape)
print(X_tr.shape,X_val.shape)
(455, 30) (114, 30)
(409, 30) (46, 30)
- 파이썬 래퍼 XGBoost와 사이킷런의 차이는 XGBoost만의 전용 데이터 객체인 DMatrix를 사용한다는 점
In [7]:
#DataFrame 기반의 학습데이터셋과 테스트데이터셋을 DMatrix로 변환하기
dtr = xgb.DMatrix(data=X_tr, label=y_tr)
dval = xgb.DMatrix(data=X_val, label=y_val)
dtest = xgb.DMatrix(data=X_test, label=y_test)
- XGBoost의 하이퍼 파라미터는 주로 딕셔너리 형태로 입력함
In [8]:
params = {'max_depth': 3, 'eta':0.05, 'objective':'binary:logistic', 'eval_metric':'logloss'}
num_rounds=400
- 파이썬 래퍼 XGBoost는 하이퍼 파라미터를 모듈의 train()에 전달
- 조기 중단은 XGBoost의 train()함수에 early_stopping_rounds 파라미터를 입력하여 설정 >> 반드시 평가용 데이터셋 지정과 eval_metric을 함께 설정해야함
In [9]:
#학습데이터셋은 'train' 또는 평가데이터셋은 'eval'로 명시
eval_list = [(dtr, 'train'), (dval, 'eval')] # 또는 eval_list = [(dval, 'eval')]만 써도 무방
xgb_model = xgb.train(params = params, dtrain=dtr, num_boost_round=num_rounds, early_stopping_rounds=50, evals=eval_list)
[0] train-logloss:0.65016 eval-logloss:0.66183
[1] train-logloss:0.61131 eval-logloss:0.63609
[2] train-logloss:0.57563 eval-logloss:0.61144
[3] train-logloss:0.54310 eval-logloss:0.59204
[4] train-logloss:0.51323 eval-logloss:0.57329
[5] train-logloss:0.48447 eval-logloss:0.55037
[6] train-logloss:0.45796 eval-logloss:0.52930
[7] train-logloss:0.43436 eval-logloss:0.51534
[8] train-logloss:0.41150 eval-logloss:0.49718
[9] train-logloss:0.39027 eval-logloss:0.48154
[10] train-logloss:0.37128 eval-logloss:0.46990
[11] train-logloss:0.35254 eval-logloss:0.45474
[12] train-logloss:0.33528 eval-logloss:0.44229
[13] train-logloss:0.31892 eval-logloss:0.42961
[14] train-logloss:0.30439 eval-logloss:0.42065
[15] train-logloss:0.29000 eval-logloss:0.40958
[16] train-logloss:0.27651 eval-logloss:0.39887
[17] train-logloss:0.26389 eval-logloss:0.39050
[18] train-logloss:0.25210 eval-logloss:0.38254
[19] train-logloss:0.24123 eval-logloss:0.37393
[20] train-logloss:0.23076 eval-logloss:0.36789
[21] train-logloss:0.22091 eval-logloss:0.36017
[22] train-logloss:0.21155 eval-logloss:0.35421
[23] train-logloss:0.20263 eval-logloss:0.34683
[24] train-logloss:0.19434 eval-logloss:0.34111
[25] train-logloss:0.18637 eval-logloss:0.33634
[26] train-logloss:0.17875 eval-logloss:0.33082
[27] train-logloss:0.17167 eval-logloss:0.32675
[28] train-logloss:0.16481 eval-logloss:0.32099
[29] train-logloss:0.15835 eval-logloss:0.31671
[30] train-logloss:0.15225 eval-logloss:0.31277
[31] train-logloss:0.14650 eval-logloss:0.30882
[32] train-logloss:0.14102 eval-logloss:0.30437
[33] train-logloss:0.13590 eval-logloss:0.30103
[34] train-logloss:0.13109 eval-logloss:0.29794
[35] train-logloss:0.12647 eval-logloss:0.29499
[36] train-logloss:0.12197 eval-logloss:0.29295
[37] train-logloss:0.11784 eval-logloss:0.29043
[38] train-logloss:0.11379 eval-logloss:0.28927
[39] train-logloss:0.10994 eval-logloss:0.28578
[40] train-logloss:0.10638 eval-logloss:0.28364
[41] train-logloss:0.10302 eval-logloss:0.28183
[42] train-logloss:0.09963 eval-logloss:0.28005
[43] train-logloss:0.09649 eval-logloss:0.27972
[44] train-logloss:0.09359 eval-logloss:0.27744
[45] train-logloss:0.09080 eval-logloss:0.27542
[46] train-logloss:0.08807 eval-logloss:0.27504
[47] train-logloss:0.08541 eval-logloss:0.27458
[48] train-logloss:0.08299 eval-logloss:0.27348
[49] train-logloss:0.08035 eval-logloss:0.27247
[50] train-logloss:0.07786 eval-logloss:0.27163
[51] train-logloss:0.07550 eval-logloss:0.27094
[52] train-logloss:0.07344 eval-logloss:0.26967
[53] train-logloss:0.07147 eval-logloss:0.27008
[54] train-logloss:0.06964 eval-logloss:0.26890
[55] train-logloss:0.06766 eval-logloss:0.26854
[56] train-logloss:0.06591 eval-logloss:0.26900
[57] train-logloss:0.06433 eval-logloss:0.26790
[58] train-logloss:0.06259 eval-logloss:0.26663
[59] train-logloss:0.06107 eval-logloss:0.26743
[60] train-logloss:0.05957 eval-logloss:0.26610
[61] train-logloss:0.05817 eval-logloss:0.26644
[62] train-logloss:0.05691 eval-logloss:0.26673
[63] train-logloss:0.05550 eval-logloss:0.26550
[64] train-logloss:0.05422 eval-logloss:0.26443
[65] train-logloss:0.05311 eval-logloss:0.26500
[66] train-logloss:0.05207 eval-logloss:0.26591
[67] train-logloss:0.05093 eval-logloss:0.26501
[68] train-logloss:0.04976 eval-logloss:0.26435
[69] train-logloss:0.04872 eval-logloss:0.26360
[70] train-logloss:0.04776 eval-logloss:0.26319
[71] train-logloss:0.04680 eval-logloss:0.26255
[72] train-logloss:0.04580 eval-logloss:0.26204
[73] train-logloss:0.04484 eval-logloss:0.26254
[74] train-logloss:0.04388 eval-logloss:0.26289
[75] train-logloss:0.04309 eval-logloss:0.26249
[76] train-logloss:0.04224 eval-logloss:0.26217
[77] train-logloss:0.04133 eval-logloss:0.26166
[78] train-logloss:0.04050 eval-logloss:0.26179
[79] train-logloss:0.03967 eval-logloss:0.26103
[80] train-logloss:0.03876 eval-logloss:0.26094
[81] train-logloss:0.03806 eval-logloss:0.26148
[82] train-logloss:0.03740 eval-logloss:0.26054
[83] train-logloss:0.03676 eval-logloss:0.25967
[84] train-logloss:0.03605 eval-logloss:0.25905
[85] train-logloss:0.03545 eval-logloss:0.26007
[86] train-logloss:0.03489 eval-logloss:0.25984
[87] train-logloss:0.03425 eval-logloss:0.25933
[88] train-logloss:0.03361 eval-logloss:0.25932
[89] train-logloss:0.03311 eval-logloss:0.26002
[90] train-logloss:0.03260 eval-logloss:0.25936
[91] train-logloss:0.03202 eval-logloss:0.25886
[92] train-logloss:0.03152 eval-logloss:0.25918
[93] train-logloss:0.03107 eval-logloss:0.25864
[94] train-logloss:0.03049 eval-logloss:0.25951
[95] train-logloss:0.03007 eval-logloss:0.26091
[96] train-logloss:0.02963 eval-logloss:0.26014
[97] train-logloss:0.02913 eval-logloss:0.25974
[98] train-logloss:0.02866 eval-logloss:0.25937
[99] train-logloss:0.02829 eval-logloss:0.25893
[100] train-logloss:0.02789 eval-logloss:0.25928
[101] train-logloss:0.02751 eval-logloss:0.25955
[102] train-logloss:0.02714 eval-logloss:0.25901
[103] train-logloss:0.02668 eval-logloss:0.25991
[104] train-logloss:0.02634 eval-logloss:0.25950
[105] train-logloss:0.02594 eval-logloss:0.25924
[106] train-logloss:0.02556 eval-logloss:0.25901
[107] train-logloss:0.02522 eval-logloss:0.25738
[108] train-logloss:0.02492 eval-logloss:0.25702
[109] train-logloss:0.02453 eval-logloss:0.25789
[110] train-logloss:0.02418 eval-logloss:0.25770
[111] train-logloss:0.02384 eval-logloss:0.25842
[112] train-logloss:0.02356 eval-logloss:0.25810
[113] train-logloss:0.02322 eval-logloss:0.25848
[114] train-logloss:0.02290 eval-logloss:0.25833
[115] train-logloss:0.02260 eval-logloss:0.25820
[116] train-logloss:0.02229 eval-logloss:0.25905
[117] train-logloss:0.02204 eval-logloss:0.25878
[118] train-logloss:0.02176 eval-logloss:0.25728
[119] train-logloss:0.02149 eval-logloss:0.25722
[120] train-logloss:0.02119 eval-logloss:0.25764
[121] train-logloss:0.02095 eval-logloss:0.25761
[122] train-logloss:0.02067 eval-logloss:0.25832
[123] train-logloss:0.02045 eval-logloss:0.25808
[124] train-logloss:0.02023 eval-logloss:0.25855
[125] train-logloss:0.01998 eval-logloss:0.25714
[126] train-logloss:0.01973 eval-logloss:0.25587
[127] train-logloss:0.01946 eval-logloss:0.25640
[128] train-logloss:0.01927 eval-logloss:0.25685
[129] train-logloss:0.01908 eval-logloss:0.25665
[130] train-logloss:0.01886 eval-logloss:0.25712
[131] train-logloss:0.01863 eval-logloss:0.25609
[132] train-logloss:0.01839 eval-logloss:0.25649
[133] train-logloss:0.01816 eval-logloss:0.25789
[134] train-logloss:0.01802 eval-logloss:0.25811
[135] train-logloss:0.01785 eval-logloss:0.25794
[136] train-logloss:0.01763 eval-logloss:0.25876
[137] train-logloss:0.01748 eval-logloss:0.25884
[138] train-logloss:0.01732 eval-logloss:0.25867
[139] train-logloss:0.01719 eval-logloss:0.25876
[140] train-logloss:0.01696 eval-logloss:0.25987
[141] train-logloss:0.01681 eval-logloss:0.25960
[142] train-logloss:0.01669 eval-logloss:0.25982
[143] train-logloss:0.01656 eval-logloss:0.25992
[144] train-logloss:0.01638 eval-logloss:0.26035
[145] train-logloss:0.01623 eval-logloss:0.26055
[146] train-logloss:0.01606 eval-logloss:0.26092
[147] train-logloss:0.01589 eval-logloss:0.26137
[148] train-logloss:0.01572 eval-logloss:0.25999
[149] train-logloss:0.01556 eval-logloss:0.26028
[150] train-logloss:0.01546 eval-logloss:0.26048
[151] train-logloss:0.01531 eval-logloss:0.26142
[152] train-logloss:0.01515 eval-logloss:0.26188
[153] train-logloss:0.01501 eval-logloss:0.26227
[154] train-logloss:0.01486 eval-logloss:0.26287
[155] train-logloss:0.01476 eval-logloss:0.26299
[156] train-logloss:0.01462 eval-logloss:0.26346
[157] train-logloss:0.01448 eval-logloss:0.26379
[158] train-logloss:0.01434 eval-logloss:0.26306
[159] train-logloss:0.01424 eval-logloss:0.26237
[160] train-logloss:0.01410 eval-logloss:0.26251
[161] train-logloss:0.01401 eval-logloss:0.26265
[162] train-logloss:0.01392 eval-logloss:0.26264
[163] train-logloss:0.01380 eval-logloss:0.26250
[164] train-logloss:0.01372 eval-logloss:0.26264
[165] train-logloss:0.01359 eval-logloss:0.26255
[166] train-logloss:0.01350 eval-logloss:0.26188
[167] train-logloss:0.01342 eval-logloss:0.26203
[168] train-logloss:0.01331 eval-logloss:0.26190
[169] train-logloss:0.01319 eval-logloss:0.26184
[170] train-logloss:0.01312 eval-logloss:0.26133
[171] train-logloss:0.01304 eval-logloss:0.26148
[172] train-logloss:0.01297 eval-logloss:0.26157
[173] train-logloss:0.01285 eval-logloss:0.26253
[174] train-logloss:0.01278 eval-logloss:0.26229
[175] train-logloss:0.01267 eval-logloss:0.26086
- 학습 수행할수록 train-logloss와 eval-logloss가 감소함
- num_boost_round을 400회로 설정했음에도 175회만 수행하고 조기 중단함
- 126회째 eval-logloss가 0.25587로 가장 낮음 >> 이후 126부터 176회까지 50회동안 logloss값이 향상되지 않았기 때문에 멈춘 것임
- 이제 예측을 수행해보자, 다만 xgboost의 predict는 예측결과값이 아니라 결과를 추정할 수 있는 확률값을 반환함
- 이 예제는 악성 or 양성의 이진분류이므로 확률이 0.5보다 크면 1, 작으면 0으로 결정하는 로직을 추가하면됌
In [10]:
pred_probs = xgb_model.predict(dtest)
print('predict() 수행 결과값을 10개만 표시, 예측 확률값임')
print(np.round(pred_probs[:10], 3))
preds = [1 if x > 0.5 else 0 for x in pred_probs]
print('예측값 10개만 표시:', preds[:10])
predict() 수행 결과값을 10개만 표시, 예측 확률값임
[0.845 0.008 0.68 0.081 0.975 0.999 0.998 0.998 0.996 0.001]
예측값 10개만 표시: [1, 0, 1, 0, 1, 1, 1, 1, 1, 0]
In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
def get_clf_eval(y_test, pred, pred_proba):
confusion = confusion_matrix(y_test, pred)
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)
roc_auc = roc_auc_score(y_test, pred_proba)
print('오차 행렬')
print(confusion)
print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f}, F1:{3:.4f}, AUC: {4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))
In [12]:
get_clf_eval(y_test, preds, pred_probs)
오차 행렬
[[34 3]
[ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740, F1:0.9677, AUC: 0.9937
- 패키지에 내장된 시각화 기능 써보기
In [14]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,12))
plot_importance(xgb_model, ax=ax)
Out[14]:
<Axes: title={'center': 'Feature importance'}, xlabel='F score', ylabel='Features'>
파이썬 래퍼 XGBoost의 cv()
- 사이킷런의 GridSearchCV와 유사하게 데이터셋에 대한 교차검증 수행 후 cv()로 최적 파라미터를 구할 수 있음
- params
- dtrain
- num_boost_round
- nfold: cv 폴드 개수
- stratified
- metrics
- early_stopping_rounds
'Data Science > 파이썬 머신러닝 완벽 가이드' 카테고리의 다른 글
[sklearn] (25) - LightGBM (0) | 2023.05.30 |
---|---|
[sklearn] (24) 사이킷런 Wrapper XGBoost (0) | 2023.05.30 |
[sklearn] (22) GBM(Gradient Boosting Machine) (0) | 2023.05.26 |
[sklearn] (21) 사용자 행동 인식 예측 분류 - DecisionTreeClassifier (0) | 2023.05.25 |
[sklearn] (20) 랜덤 포레스트 RandomForestClassifier (1) | 2023.05.24 |