[sklearn] (18) 결정 트리 과적합 overfitting - make_classification(), visualize_boundary()

얆생 2023. 5. 24. 13:12

결정 트리 과적합 Overfitting

결정 트리가 어떻게 학습데이터를 분할해 예측을 수행하는지와 이로 인한 과적합 문제를 시각화하여 알아보자

사이킷런이 제공하는 make_classificaition() 함수를 이용해서 임의의 데이터셋 만들기
make_classification 호출 시 반환되는 객체는 피처 데이터셋과 클래스 레이블 데이터셋임

2개의 피처가 3가지 유형의 클래스값을 가지는 데이터셋 만들어보자

In [ ]:

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

plt.title('3 Class values with 2 Features Sample data creation')

#2차원 시각화를 위해서 피처는 2개, 클래스는 3가지 유형으로 샘플데이터 생성
X_features, y_labels = make_classification(n_features=2, n_redundant=0, n_informative=2, n_classes=3, n_clusters_per_class=1, random_state=0)

#2개의 피처로 2차원 좌표 시각화, 각 클래스값은 다른 색으로 표시
plt.scatter(X_features[:,0], X_features[:, 1], marker='o', c=y_labels, s=25, edgecolor='k')

Out[ ]:

<matplotlib.collections.PathCollection at 0x7f849cf3ae90>

첫 번째 학습 시에는 결정 트리 생성에 별다른 제약이 없도록 하이퍼 파라미터를 디폴트로 하고, 모델이 어떤 기준을 가지고 분할하는지 확인하기

이를 위해 visualize_boundary() 함수 별도 생성 >> 클래스값을 예측하는 결정 기준을 색상과 경계로 나타내줌

In [ ]:

import numpy as np

# Classifier의 Decision Boundary를 시각화 하는 함수
def visualize_boundary(model, X, y):
    fig,ax = plt.subplots()
    
    # 학습 데이타 scatter plot으로 나타내기
    ax.scatter(X[:, 0], X[:, 1], c=y, s=25, cmap='rainbow', edgecolor='k',
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim_start , xlim_end = ax.get_xlim()
    ylim_start , ylim_end = ax.get_ylim()
    
    # 호출 파라미터로 들어온 training 데이타로 model 학습 . 
    model.fit(X, y)
    # meshgrid 형태인 모든 좌표값으로 예측 수행. 
    xx, yy = np.meshgrid(np.linspace(xlim_start,xlim_end, num=200),np.linspace(ylim_start,ylim_end, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    # contourf() 를 이용하여 class boundary 를 visualization 수행. 
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap='rainbow', clim=(y.min(), y.max()),
                           zorder=1)

In [ ]:

from sklearn.tree import DecisionTreeClassifier

#제약 없는 결정 트리의 학습과 결정 경계 시각화
dt_clf = DecisionTreeClassifier(random_state=156).fit(X_features, y_labels)
visualize_boundary(dt_clf, X_features, y_labels)

<ipython-input-3-e449a4e0d29e>:23: UserWarning: The following kwargs were not used by contour: 'clim'
  contours = ax.contourf(xx, yy, Z, alpha=0.3,

일부 이상치 데이터까지 분류하기 위해 분할이 자주 일어나서 결정 기준 경계가 매우 많아짐
결정 트리 하이퍼 파라미터의 기본 설정은 리프 노드 안에 데이터가 모두 균일하거나 하나만 존재해야 하는 엄격한 기준때문에 모델이 복잡해짐

2. min_samples_leaf=6으로 트리 생성 조건에 제약을 걸어보자

In [ ]:

dt_clf = DecisionTreeClassifier(min_samples_leaf=6, random_state=156).fit(X_features, y_labels)
visualize_boundary(dt_clf, X_features, y_labels)

<ipython-input-3-e449a4e0d29e>:23: UserWarning: The following kwargs were not used by contour: 'clim'
  contours = ax.contourf(xx, yy, Z, alpha=0.3,

이상치에 반응하지 않으면서 좀 더 일반화된 분류 규칙에 따라 분류됐음을 알 수 있다
예측 성능은 두 번째 모델이 더 뛰어날 가능성이 높다 >> 학습데이터에만 지나치게 최적화된 분류 기쥰은 정확도를 떨어뜨리기 때문