In [ ]:
!pip install fancyimpute
Collecting fancyimpute
Downloading fancyimpute-0.7.0.tar.gz (25 kB)
Preparing metadata (setup.py) ... done
Collecting knnimpute>=0.1.0 (from fancyimpute)
Downloading knnimpute-0.1.0.tar.gz (8.3 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: scikit-learn>=0.24.2 in /usr/local/lib/python3.10/dist-packages (from fancyimpute) (1.2.2)
Requirement already satisfied: cvxpy in /usr/local/lib/python3.10/dist-packages (from fancyimpute) (1.3.3)
Requirement already satisfied: cvxopt in /usr/local/lib/python3.10/dist-packages (from fancyimpute) (1.3.2)
Requirement already satisfied: pytest in /usr/local/lib/python3.10/dist-packages (from fancyimpute) (7.4.4)
Collecting nose (from fancyimpute)
Downloading nose-1.3.7-py3-none-any.whl (154 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.7/154.7 kB 3.1 MB/s eta 0:00:00
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from knnimpute>=0.1.0->fancyimpute) (1.16.0)
Requirement already satisfied: numpy>=1.10 in /usr/local/lib/python3.10/dist-packages (from knnimpute>=0.1.0->fancyimpute) (1.25.2)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.2->fancyimpute) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.2->fancyimpute) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.2->fancyimpute) (3.4.0)
Requirement already satisfied: osqp>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from cvxpy->fancyimpute) (0.6.2.post8)
Requirement already satisfied: ecos>=2 in /usr/local/lib/python3.10/dist-packages (from cvxpy->fancyimpute) (2.0.13)
Requirement already satisfied: scs>=1.1.6 in /usr/local/lib/python3.10/dist-packages (from cvxpy->fancyimpute) (3.2.4.post1)
Requirement already satisfied: setuptools>65.5.1 in /usr/local/lib/python3.10/dist-packages (from cvxpy->fancyimpute) (67.7.2)
Requirement already satisfied: iniconfig in /usr/local/lib/python3.10/dist-packages (from pytest->fancyimpute) (2.0.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from pytest->fancyimpute) (24.0)
Requirement already satisfied: pluggy<2.0,>=0.12 in /usr/local/lib/python3.10/dist-packages (from pytest->fancyimpute) (1.4.0)
Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /usr/local/lib/python3.10/dist-packages (from pytest->fancyimpute) (1.2.0)
Requirement already satisfied: tomli>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from pytest->fancyimpute) (2.0.1)
Requirement already satisfied: qdldl in /usr/local/lib/python3.10/dist-packages (from osqp>=0.4.1->cvxpy->fancyimpute) (0.1.7.post0)
Building wheels for collected packages: fancyimpute, knnimpute
Building wheel for fancyimpute (setup.py) ... done
Created wheel for fancyimpute: filename=fancyimpute-0.7.0-py3-none-any.whl size=29881 sha256=69659b92f38c93063f7045ae34fb6555c5ad4ff7f0b8e1016e663aa766542fb9
Stored in directory: /root/.cache/pip/wheels/7b/0c/d3/ee82d1fbdcc0858d96434af108608d01703505d453720c84ed
Building wheel for knnimpute (setup.py) ... done
Created wheel for knnimpute: filename=knnimpute-0.1.0-py3-none-any.whl size=11330 sha256=965cd83d5045c712639e5e7cfa876114e89694f8f2245058d32a5d3d549edfd7
Stored in directory: /root/.cache/pip/wheels/46/06/a5/45a724630562413c374e29c08732411d496092408b3a7bf754
Successfully built fancyimpute knnimpute
Installing collected packages: nose, knnimpute, fancyimpute
Successfully installed fancyimpute-0.7.0 knnimpute-0.1.0 nose-1.3.7
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer, KNNImputer
from fancyimpute import IterativeImputer
결측치를 포함한 예시 데이터 만들기¶
In [ ]:
np.random.seed(0)
sample_data = np.random.normal(1,10,100)
sample_data
Out[ ]:
array([ 18.64052346, 5.00157208, 10.78737984, 23.40893199,
19.6755799 , -8.7727788 , 10.50088418, -0.51357208,
-0.03218852, 5.10598502, 2.44043571, 15.54273507,
8.61037725, 2.21675016, 5.43863233, 4.33674327,
15.94079073, -1.05158264, 4.13067702, -7.54095739,
-24.52989816, 7.53618595, 9.64436199, -6.4216502 ,
23.69754624, -13.54365675, 1.45758517, -0.8718385 ,
16.32779214, 15.6935877 , 2.54947426, 4.7816252 ,
-7.87785748, -18.80796468, -2.47912149, 2.56348969,
13.30290681, 13.02379849, -2.87326817, -2.02302751,
-9.48552965, -13.20017937, -16.06270191, 20.50775395,
-4.09652182, -3.38074302, -11.5279536 , 8.77490356,
-15.13897848, -1.1274028 , -7.95466561, 4.86902498,
-4.10805138, -10.80632184, 0.71817772, 5.28331871,
1.66517222, 4.02471898, -5.34322094, -2.62741166,
-5.72460448, -2.59553162, -7.13146282, -16.26282602,
2.77426142, -3.01780936, -15.30198347, 5.62782256,
-8.07298364, 1.51945396, 8.29090562, 2.28982911,
12.39400685, -11.3482582 , 5.02341641, -5.84810091,
-7.70797149, -4.78849665, -2.11552532, 1.56165342,
-10.65149841, 10.00826487, 5.6566244 , -14.36243686,
15.88252194, 19.95889176, 12.78779571, -0.79924836,
-9.70752622, 11.54451727, -3.03176947, 13.2244507 ,
3.08274978, 10.76639036, 4.56366397, 8.06573168,
1.10500021, 18.85870494, 2.26912093, 5.01989363])
In [ ]:
df = pd.DataFrame(sample_data, columns=['feature'])
df.loc[10:30, 'feature']=np.nan
df.loc[55:60, 'feature']=np.nan
df.loc[80:85, 'feature']=np.nan
In [ ]:
plt.figure(figsize=(10,4))
sns.lineplot(data=df, marker='o')
plt.title('Original')
plt.show()
1차 선형보간법¶
- interpolate 함수로 결측치 보간
- 기본적인 선형보간 외에도 method 인자를 사용해 보간 방법 지정 가능
- (method='polynomial', order=2)는 2차 다항 보간
In [ ]:
df_linear = df.interpolate(method='linear')
2차 선형보간법¶
In [ ]:
df_quadratic = df.interpolate(method='quadratic')
평균대치법¶
- 평균은 중심에 대한 경향성을 알 수 있는 척도이지만 모든 관측치값을 반영하므로 이상치의 영향을 많이 받는다.
- 평균을 이용하기 때문에 수치형 변수만 사용 가능
- strategy 인자를 사용해 대치법 지정 가능
- mean(평균)
- median(중앙값)
- most_frequent(최빈값, 빈도수 사용하기 때문에 범주형 변수에만 사용 가능)
In [ ]:
imputer_mean = SimpleImputer(strategy = 'mean')
df_mean = imputer_mean.fit_transform(df1)
df_mean = pd.DataFrame(df_mean, columns = ['feature'])
In [ ]:
imputer_mean = SimpleImputer(strategy = 'most_frequent')
df_mean = imputer_mean.fit_transform(df)
fillna()를 이용해서 0으로 대체¶
- fillna() 안에 원하는 값 넣을 수 있음
In [ ]:
df_zero = df.fillna(0)
KNN 방법¶
- KNN이란?
- 패턴 인식에서 k-최근접 이웃 알고리즘은 분류나 회귀에 사용되는 비모수 방식
- KNNImputer를 활용하여 원하는 인접 이웃 수의 가중 또는 가중 평균을 사용하여 결측값 대치
- → 결측치를 기준으로 두 이웃(n_neighbors = 2)의 근사값으로 채워짐
In [ ]:
imputer_knn = KNNImputer(n_neighbors = 2)
df_knn = imputer_knn.fit_transform(df)
df_knn = pd.DataFrame(df_knn, columns=['feature'])
MICE 다중대치법¶
- 결측값을 회귀하는 방식으로 처리하기 때문에 수치형 변수에 자주 사용함
- 범주형 변수에는 먼저 인코딩 후 사용 가능
In [ ]:
mice_imputer = IterativeImputer()
df_mice = mice_imputer.fit_transform(df)
df_mice = pd.DataFrame(df_mice, columns=['feature'])
In [ ]:
df_quadratic[5:10]
Out[ ]:
feature | |
---|---|
5 | -8.772779 |
6 | 10.500884 |
7 | -0.513572 |
8 | -0.032189 |
9 | 5.105985 |
- 결측치를 확인하는 과정에서는 데이터의 도메인 지식이 필요하다
- → 결측치가 유의미한 데이터일 수도 있기 때문
- 결측치를 삭제하는 방법으로 처리하는 것은 리스크가 따를 수 있음
- → 데이터의 누락으로 이어질 수 있기 때문
시각화를 통해 어떤 식으로 변환하는지 확인¶
In [ ]:
fig, axs = plt.subplots(7,1,figsize=(10,12))
sns.lineplot(data= df, marker='o', ax=axs[0], legend='auto')
axs[0].set_title('Original Data')
sns.lineplot(data=df_linear , marker='o', ax=axs[1], legend='auto')
axs[1].set_title('Linear Data')
sns.lineplot(data=df_quadratic , marker='o', ax=axs[2], legend='auto')
axs[2].set_title('Quadratic Data')
sns.lineplot(data=df_mean , marker='o', ax=axs[3], legend='auto')
axs[3].set_title('Simple Mean')
sns.lineplot(data=df_zero , marker='o', ax=axs[4], legend='auto')
axs[4].set_title('Zero')
sns.lineplot(data=df_knn , marker='o', ax=axs[5], legend='auto')
axs[5].set_title('df_knn')
sns.lineplot(data=df_mice , marker='o', ax=axs[6], legend='auto')
axs[6].set_title('df_mice')
Out[ ]:
Text(0.5, 1.0, 'df_mice')
- 결측치를 보간하고 데이터 분석, 모델링 할 때 성능을 비교할 수 있음
'Data Science > Data Analysis' 카테고리의 다른 글
[kaggle] 타이타닉 생존률 예측하기 (1) - EDA (0) | 2023.05.04 |
---|---|
[EDA] 탐색적 데이터 분석, 와인 데이터 Wine Data (0) | 2023.04.19 |
[EDA] 탐색적 데이터 분석 개념 (0) | 2023.03.29 |