감귤 착과량 예측 AI 경진대회

https://dacon.io/competitions/official/236038/overview/description

ID : 과수나무 고유 ID
착과량(int) : 실제 감귤 착과량 (Target)
나무 생육 상태 Features (5개): 수고(m), 수관폭1(min), 수관폭2(max), 수관폭평균(수관폭1과 수관폭2의 평균) (데이터 기입은 cm 단위)
새순 Features (89개): 2022년 09월 01일 ~ 2022년 11월 28일에 일별 측정된 새순 데이터
엽록소 Features (89개): 2022년 09월 01일 ~ 2022년 11월 28일에 일별 측정된 엽록소 데이터

목적: 감귤나무의 나무 생육 상태, 엽록소 및 새순 정보로부터 감귤 착과량을 회귀 예측
시간이 경과할 수록 새순 피처의 변화, 엽록소 피처의 변화를 따로 봐야할 것 같음
감귤 착과량에 가장 영향을 많이 주는 피처를 찾기/나무 생육 상태 5가지에 따른 착과량 비교하기

In [1]:

!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~/.cache/matplotlib -rf

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  fonts-nanum
0 upgraded, 1 newly installed, 0 to remove and 46 not upgraded.
Need to get 9,599 kB of archives.
After this operation, 29.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 fonts-nanum all 20180306-3 [9,599 kB]
Fetched 9,599 kB in 0s (54.6 MB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package fonts-nanum.
(Reading database ... 122541 files and directories currently installed.)
Preparing to unpack .../fonts-nanum_20180306-3_all.deb ...
Unpacking fonts-nanum (20180306-3) ...
Setting up fonts-nanum (20180306-3) ...
Processing triggers for fontconfig (2.13.1-2ubuntu3) ...
/usr/share/fonts: caching, new cache contents: 0 fonts, 1 dirs
/usr/share/fonts/truetype: caching, new cache contents: 0 fonts, 3 dirs
/usr/share/fonts/truetype/humor-sans: caching, new cache contents: 1 fonts, 0 dirs
/usr/share/fonts/truetype/liberation: caching, new cache contents: 16 fonts, 0 dirs
/usr/share/fonts/truetype/nanum: caching, new cache contents: 10 fonts, 0 dirs
/usr/local/share/fonts: caching, new cache contents: 0 fonts, 0 dirs
/root/.local/share/fonts: skipping, no such directory
/root/.fonts: skipping, no such directory
/usr/share/fonts/truetype: skipping, looped directory detected
/usr/share/fonts/truetype/humor-sans: skipping, looped directory detected
/usr/share/fonts/truetype/liberation: skipping, looped directory detected
/usr/share/fonts/truetype/nanum: skipping, looped directory detected
/var/cache/fontconfig: cleaning cache directory
/root/.cache/fontconfig: not cleaning non-existent cache directory
/root/.fontconfig: not cleaning non-existent cache directory
fc-cache: succeeded

In [11]:

import pandas as pd
import seaborn as sns
import numpy as np
import sklearn
import matplotlib.pyplot as plt

plt.rc('font', family='NanumBarunGothic')

간단 EDA

In [13]:

train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')

In [14]:

train.head()

Out[14]:

	ID	착과량(int)	수고(m)	수관폭1(min)	수관폭2(max)	수관폭평균	2022-09-01 새순	2022-09-02 새순	2022-09-03 새순	2022-09-04 새순	...	2022-11-19 엽록소	2022-11-20 엽록소	2022-11-21 엽록소	2022-11-22 엽록소	2022-11-23 엽록소	2022-11-24 엽록소	2022-11-25 엽록소	2022-11-26 엽록소	2022-11-27 엽록소	2022-11-28 엽록소
0	TRAIN_0000	692	275.0	287.0	292.0	289.5	2.8	2.8	2.7	2.7	...	70.978249	70.876794	70.705253	70.559603	70.427356	70.340491	70.293830	70.262422	70.169841	70.043251
1	TRAIN_0001	534	293.0	284.0	336.0	310.0	3.3	3.3	3.3	3.2	...	71.535483	71.382303	71.253604	71.092665	70.955608	70.796630	70.597550	70.565088	70.560502	70.427600
2	TRAIN_0002	634	300.0	392.0	450.0	421.0	3.0	2.9	2.9	2.9	...	71.279804	71.199570	71.144020	71.026740	70.920038	70.876723	70.710129	70.595971	70.418203	70.399578
3	TRAIN_0003	639	289.0	368.0	379.0	373.5	3.1	3.0	3.0	3.0	...	69.934615	69.884124	69.845683	69.794682	69.779813	69.614644	69.455404	69.281590	69.238689	69.133970
4	TRAIN_0004	496	306.0	353.0	358.0	355.5	3.7	3.6	3.6	3.6	...	68.313016	68.285364	68.209860	68.209458	68.040083	67.859963	67.775556	67.701456	67.504244	67.410093

5 rows × 184 columns

In [15]:

test.head()

Out[15]:

	ID	수고(m)	수관폭1(min)	수관폭2(max)	수관폭평균	2022-09-01 새순	2022-09-02 새순	2022-09-03 새순	2022-09-04 새순	2022-09-05 새순	...	2022-11-19 엽록소	2022-11-20 엽록소	2022-11-21 엽록소	2022-11-22 엽록소	2022-11-23 엽록소	2022-11-24 엽록소	2022-11-25 엽록소	2022-11-26 엽록소	2022-11-27 엽록소	2022-11-28 엽록소
0	TEST_0000	231.0	219.0	302.0	260.5	4.3	4.2	4.2	4.1	4.1	...	72.091497	71.906176	71.886799	71.858720	71.725540	71.531618	71.476935	71.321237	71.184295	71.000223
1	TEST_0001	206.0	324.0	346.0	335.0	2.5	2.5	2.5	2.5	2.5	...	68.423776	68.353200	68.346777	68.188159	67.995233	67.879958	67.744820	67.673522	67.539973	67.447849
2	TEST_0002	282.0	373.0	404.0	388.5	4.7	4.7	4.6	4.6	4.5	...	74.435783	74.247468	74.142664	74.070401	73.892408	73.796143	73.615838	73.440136	73.387051	73.228319
3	TEST_0003	234.0	337.0	357.0	347.0	3.6	3.6	3.6	3.6	3.5	...	73.570735	73.420243	73.313538	73.149298	73.109599	72.924511	72.847411	72.838263	72.672156	72.538171
4	TEST_0004	222.0	329.0	347.0	338.0	2.7	2.7	2.7	2.7	2.6	...	72.216758	72.163153	72.052952	71.905982	71.766141	71.746813	71.668847	71.501043	71.419715	71.328224

5 rows × 183 columns

In [16]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2207 entries, 0 to 2206
Columns: 184 entries, ID to 2022-11-28 엽록소
dtypes: float64(182), int64(1), object(1)
memory usage: 3.1+ MB

In [17]:

train.isna().sum()

Out[17]:

ID                0
착과량(int)          0
수고(m)             0
수관폭1(min)         0
수관폭2(max)         0
                 ..
2022-11-24 엽록소    0
2022-11-25 엽록소    0
2022-11-26 엽록소    0
2022-11-27 엽록소    0
2022-11-28 엽록소    0
Length: 184, dtype: int64

In [18]:

test.isna().sum()

Out[18]:

ID                0
수고(m)             0
수관폭1(min)         0
수관폭2(max)         0
수관폭평균             0
                 ..
2022-11-24 엽록소    0
2022-11-25 엽록소    0
2022-11-26 엽록소    0
2022-11-27 엽록소    0
2022-11-28 엽록소    0
Length: 183, dtype: int64

In [19]:

train.describe()

Out[19]:

	착과량(int)	수고(m)	수관폭1(min)	수관폭2(max)	수관폭평균	2022-09-01 새순	2022-09-02 새순	2022-09-03 새순	2022-09-04 새순	2022-09-05 새순	...	2022-11-19 엽록소	2022-11-20 엽록소	2022-11-21 엽록소	2022-11-22 엽록소	2022-11-23 엽록소	2022-11-24 엽록소	2022-11-25 엽록소	2022-11-26 엽록소	2022-11-27 엽록소	2022-11-28 엽록소
count	2207.000000	2207.00000	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000	...	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000	2207.000000
mean	406.222474	243.72995	315.186226	358.824196	337.005211	3.774807	3.736928	3.701495	3.663435	3.627503	...	68.992036	68.890661	68.791536	68.692934	68.593709	68.493569	68.393050	68.291350	68.192009	68.093096
std	218.978302	35.88914	66.327994	68.862898	64.859684	0.774181	0.764067	0.754751	0.745394	0.733896	...	4.124586	4.124429	4.125817	4.128586	4.130005	4.130072	4.131859	4.134976	4.135591	4.136995
min	1.000000	95.00000	95.000000	123.000000	109.000000	2.300000	2.200000	2.200000	2.200000	2.200000	...	54.246243	54.154888	54.143798	54.028089	53.884892	53.695860	53.649032	53.644994	53.450069	53.379849
25%	221.000000	223.00000	271.500000	313.500000	294.000000	3.100000	3.100000	3.050000	3.000000	3.000000	...	66.195524	66.075372	65.995234	65.906703	65.765257	65.691065	65.609606	65.487863	65.374528	65.278624
50%	404.000000	243.00000	316.000000	364.000000	340.500000	3.800000	3.700000	3.700000	3.700000	3.600000	...	69.048956	68.920295	68.841060	68.731164	68.647716	68.514740	68.392436	68.292889	68.207648	68.102235
75%	593.500000	267.00000	361.000000	402.000000	380.000000	4.400000	4.400000	4.300000	4.300000	4.300000	...	71.740281	71.615746	71.504264	71.390195	71.311027	71.227033	71.152030	71.038257	70.947707	70.850934
max	799.000000	381.00000	559.000000	584.000000	571.500000	5.300000	5.300000	5.200000	5.200000	5.100000	...	83.363883	83.227379	83.036103	82.978876	82.842139	82.722266	82.575149	82.491041	82.371832	82.321938

8 rows × 183 columns

In [20]:

growth = train[['착과량(int)', '수고(m)', '수관폭1(min)', '수관폭2(max)', '수관폭평균']]
growth

Out[20]:

	착과량(int)	수고(m)	수관폭1(min)	수관폭2(max)	수관폭평균
0	692	275.0	287.0	292.0	289.5
1	534	293.0	284.0	336.0	310.0
2	634	300.0	392.0	450.0	421.0
3	639	289.0	368.0	379.0	373.5
4	496	306.0	353.0	358.0	355.5
...	...	...	...	...	...
2202	119	230.0	353.0	367.0	360.0
2203	611	266.0	337.0	400.0	368.5
2204	650	237.0	353.0	362.0	357.5
2205	363	281.0	337.0	403.0	370.0
2206	17	228.0	389.0	398.0	393.5

2207 rows × 5 columns

In [21]:

growth.corr()

Out[21]:

	착과량(int)	수고(m)	수관폭1(min)	수관폭2(max)	수관폭평균
착과량(int)	1.000000	0.007002	0.021279	0.037168	0.030611
수고(m)	0.007002	1.000000	0.456394	0.484295	0.490456
수관폭1(min)	0.021279	0.456394	1.000000	0.841330	0.957948
수관폭2(max)	0.037168	0.484295	0.841330	1.000000	0.961049
수관폭평균	0.030611	0.490456	0.957948	0.961049	1.000000

In [22]:

sns.heatmap(growth.corr(), cmap='Greens', annot=True)

Out[22]:

<Axes: >

나무 생육 상태 4가지와 착과량은 상관성이 없음

In [23]:

saesun = train.loc[:, '2022-09-01 새순':'2022-11-28 새순']
saesun

Out[23]:

	2022-09-01 새순	2022-09-02 새순	2022-09-03 새순	2022-09-04 새순	2022-09-05 새순	2022-09-06 새순	2022-09-07 새순	2022-09-08 새순	2022-09-09 새순	2022-09-10 새순	...	2022-11-19 새순	2022-11-20 새순	2022-11-21 새순	2022-11-22 새순	2022-11-23 새순	2022-11-24 새순	2022-11-25 새순	2022-11-26 새순	2022-11-27 새순	2022-11-28 새순
0	2.8	2.8	2.7	2.7	2.7	2.7	2.6	2.6	2.6	2.6	...	0.9	0.8	0.8	0.8	0.7	0.7	0.7	0.7	0.7	0.6
1	3.3	3.3	3.3	3.2	3.2	3.1	3.1	3.1	3.0	3.0	...	0.7	0.7	0.6	0.6	0.6	0.5	0.0	0.0	0.0	0.0
2	3.0	2.9	2.9	2.9	2.9	2.8	2.8	2.8	2.8	2.7	...	0.8	0.8	0.7	0.7	0.7	0.7	0.6	0.6	0.6	0.5
3	3.1	3.0	3.0	3.0	3.0	2.9	2.9	2.9	2.8	2.8	...	0.9	0.9	0.9	0.8	0.8	0.8	0.8	0.7	0.7	0.7
4	3.7	3.6	3.6	3.6	3.5	3.5	3.5	3.4	3.4	3.3	...	0.9	0.9	0.8	0.8	0.8	0.7	0.7	0.7	0.6	0.6
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2202	4.7	4.6	4.5	4.5	4.4	4.4	4.3	4.3	4.2	4.2	...	0.8	0.8	0.8	0.7	0.7	0.6	0.6	0.5	0.0	0.0
2203	3.1	3.0	3.0	3.0	2.9	2.9	2.9	2.9	2.8	2.8	...	0.9	0.9	0.8	0.8	0.8	0.7	0.7	0.7	0.7	0.6
2204	3.0	3.0	2.9	2.9	2.9	2.9	2.8	2.8	2.8	2.8	...	0.8	0.8	0.7	0.7	0.7	0.7	0.6	0.6	0.6	0.5
2205	3.9	3.9	3.8	3.8	3.8	3.7	3.7	3.6	3.6	3.6	...	0.8	0.8	0.8	0.7	0.7	0.7	0.6	0.6	0.5	0.0
2206	5.3	5.2	5.2	5.1	5.0	5.0	4.9	4.9	4.8	4.8	...	0.9	0.8	0.8	0.7	0.7	0.6	0.6	0.5	0.0	0.0

2207 rows × 89 columns

In [24]:

saesun.mean()

Out[24]:

2022-09-01 새순    3.774807
2022-09-02 새순    3.736928
2022-09-03 새순    3.701495
2022-09-04 새순    3.663435
2022-09-05 새순    3.627503
                   ...   
2022-11-24 새순    0.682102
2022-11-25 새순    0.632170
2022-11-26 새순    0.559991
2022-11-27 새순    0.454599
2022-11-28 새순    0.352832
Length: 89, dtype: float64

시간이 지날수록 새순 개수가 줄어듬

In [25]:

yeoprokso = train.loc[:, '2022-09-01 엽록소':'2022-11-28 엽록소']
yeoprokso

Out[25]:

	2022-09-01 엽록소	2022-09-02 엽록소	2022-09-03 엽록소	2022-09-04 엽록소	2022-09-05 엽록소	2022-09-06 엽록소	2022-09-07 엽록소	2022-09-08 엽록소	2022-09-09 엽록소	2022-09-10 엽록소	...	2022-11-19 엽록소	2022-11-20 엽록소	2022-11-21 엽록소	2022-11-22 엽록소	2022-11-23 엽록소	2022-11-24 엽록소	2022-11-25 엽록소	2022-11-26 엽록소	2022-11-27 엽록소	2022-11-28 엽록소
0	78.336504	78.243462	78.166501	78.127526	78.058021	77.915689	77.861660	77.746645	77.592138	77.572477	...	70.978249	70.876794	70.705253	70.559603	70.427356	70.340491	70.293830	70.262422	70.169841	70.043251
1	80.116691	79.974397	79.890399	79.709278	79.531061	79.348139	79.151023	79.128739	79.084197	78.924476	...	71.535483	71.382303	71.253604	71.092665	70.955608	70.796630	70.597550	70.565088	70.560502	70.427600
2	79.118529	78.941452	78.928197	78.741878	78.713691	78.654247	78.461743	78.323264	78.219696	78.123820	...	71.279804	71.199570	71.144020	71.026740	70.920038	70.876723	70.710129	70.595971	70.418203	70.399578
3	77.936262	77.839523	77.815607	77.644644	77.546555	77.480441	77.383665	77.282036	77.126354	77.121300	...	69.934615	69.884124	69.845683	69.794682	69.779813	69.614644	69.455404	69.281590	69.238689	69.133970
4	76.233231	76.186572	76.031032	75.886553	75.736449	75.735036	75.642223	75.458496	75.380945	75.272054	...	68.313016	68.285364	68.209860	68.209458	68.040083	67.859963	67.775556	67.701456	67.504244	67.410093
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2202	74.024614	73.876933	73.853205	73.721799	73.666974	73.540144	73.449933	73.349051	73.310852	73.273704	...	66.347384	66.293609	66.193131	66.176590	66.109845	66.008775	65.889675	65.734066	65.586656	65.560547
2203	81.920876	81.898801	81.769536	81.607251	81.456723	81.408951	81.312932	81.181893	81.027196	81.010124	...	73.438844	73.432194	73.355195	73.329491	73.271361	73.088750	73.071705	73.069036	72.952559	72.944396
2204	75.486518	75.451905	75.315223	75.252547	75.186170	75.099348	75.093451	75.050447	74.877088	74.711044	...	67.576066	67.404347	67.394009	67.247428	67.117025	66.933771	66.734617	66.572095	66.506006	66.479459
2205	74.110017	74.017373	73.886675	73.720486	73.712709	73.532512	73.352104	73.230786	73.183439	73.006241	...	66.766379	66.602606	66.510781	66.334637	66.323613	66.248731	66.202532	66.108962	65.934979	65.812084
2206	80.830756	80.687949	80.517089	80.339555	80.157257	79.970405	79.781988	79.764210	79.724992	79.591028	...	72.885385	72.734930	72.629823	72.454654	72.389570	72.271680	72.208433	72.070977	71.872453	71.870512

2207 rows × 89 columns

In [26]:

yeoprokso.mean()

Out[26]:

2022-09-01 엽록소    76.899788
2022-09-02 엽록소    76.801492
2022-09-03 엽록소    76.699545
2022-09-04 엽록소    76.599512
2022-09-05 엽록소    76.500475
                    ...    
2022-11-24 엽록소    68.493569
2022-11-25 엽록소    68.393050
2022-11-26 엽록소    68.291350
2022-11-27 엽록소    68.192009
2022-11-28 엽록소    68.093096
Length: 89, dtype: float64

시간이 지날수록 엽록소도 줄어들지만, 급격한 변화는 아님

여기서부터 https://dacon.io/competitions/official/236038/codeshare/7298?page=1&dtype=recent 참고

전처리

NMAE(Normalized Mean Absolute Error): 시계열 데이터일 때 많이 이용하는 평가산식
- MAE(Mean Absolute Error): 예측값과 실제값간의 차이에 절댓값을 취하고 이를 평균 낸 지표. 값이 작을수록 잘 예측한 것
정규화된 MAE이기 때문에 편향되지 않아 scale이 다른 데이터셋의 MAE 비교에 용이함

NMAE 평가를 위한 코드

In [27]:

def NMAE(true, pred):
  mae = np.mean(np.abs(true-pred))
  score = mae / np.mean(np.abs(true))
  return score

파생변수 생성

period_i_새순 : 9일의 새순 평균 컬럼

새순 diff : 새순 max 컬럼 - 새순 min 컬럼

새순 max 컬럼

새순 min 컬럼

나무 부피 컬럼 : 수고(m) x 수관폭평균

엽록소 컬럼의 경우 target 값과의 연관성이 부족하다 생각해 all drop

In [28]:

def reduce_col(df, trainortest):
  saesun = train.loc[:, '2022-09-01 새순':'2022-11-28 새순']
  yeoprokso = train.loc[:, '2022-09-01 엽록소':'2022-11-28 엽록소']

  sarr = np.array(saesun.columns)
  sarr = np.append(sarr, 'd').reshape(10, 9)  #새순데이터가 89개이므로 1개를 임의로 추가해서 90개 맞춰준 후 reshape하기
  sdate_dic = {'period_1 새순':sarr[0],
               'period_2 새순':sarr[1],
               'period_3 새순':sarr[2],
               'period_4 새순':sarr[3],
               'period_5 새순':sarr[4],
               'period_6 새순':sarr[5],
               'period_7 새순':sarr[6],
               'period_8 새순':sarr[7],
               'period_9 새순':sarr[8],
               'period_10 새순':sarr[9][:-1]
               }

  new = pd.DataFrame()
  for period, dates in sdate_dic.items():
    new[period] = saesun[dates].mean(axis=1)

  new['새순 diff'] = train['2022-09-01 새순']-df['2022-11-28 새순']
  new['새순 max'] = train['2022-09-01 새순']
  new['새순 min'] = train['2022-11-28 새순']

  if trainortest == 'train':
    new['착과량'] = train['착과량(int)']

  new['나무부피'] = train['수고(m)'] * train['수관폭평균']

  return new

데이터셋의 앞서 생성한 함수 적용하고 X, y에 간추린 피처들만 할당

In [29]:

X = reduce_col(train, 'train').drop('착과량', axis=1)
y = reduce_col(train, 'train')['착과량']

In [30]:

X.columns

Out[30]:

Index(['period_1 새순', 'period_2 새순', 'period_3 새순', 'period_4 새순',
       'period_5 새순', 'period_6 새순', 'period_7 새순', 'period_8 새순',
       'period_9 새순', 'period_10 새순', '새순 diff', '새순 max', '새순 min', '나무부피'],
      dtype='object')

모델링(1) RandomForestRegressor

그리드서치

In [32]:

from sklearn.model_selection import ShuffleSplit
#임의분할 교차검정, 사전에 설정한 비율에 따라 데이터셋 인덱스를 무작위로  분할
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, cross_validate
import time

start = time.time()

rf = RandomForestRegressor(n_jobs = -1, random_state=0)
param_grid = {'max_depth':[3, 4, 5, 6, 7,],
              'bootstrap': [True, False],
              'max_features': ['auto', 'sqrt', 'log2'],
              'criterion': ['squared_error', 'absolute_error', 'friedman_mse', 'poisson']}

ss = ShuffleSplit(test_size=0.3, random_state=0)
grid_rf_cv = GridSearchCV(rf, return_train_score=True,
                          param_grid=param_grid,
                          cv=ss,
                          verbose=0, scoring='neg_mean_absolute_error',
                          n_jobs=-1)
grid_rf_cv.fit(X, y)

end = time.time()
print('수행 시간: {0:.3f}'.format(end-start))
print('최적 매개변수 조합:', grid_rf_cv.best_params_)
print('최고의 교차 검증 점수:', grid_rf_cv.best_score_)

수행 시간: 1542.643
최적 매개변수 조합: {'bootstrap': False, 'criterion': 'friedman_mse', 'max_depth': 6, 'max_features': 'sqrt'}
최고의 교차 검증 점수: -30.18017600138905

최적 모델을 best_rf에 저장하고 NMAE 확인

In [33]:

best_rf = grid_rf_cv.best_estimator_
y_pred_rf = best_rf.predict(X)
print(f'NMAE: {NMAE(y, y_pred_rf)}')

NMAE: 0.06610512938333356

test data로 예측하기

In [34]:

X_test = reduce_col(test, 'test')
test_pred_rf = best_rf.predict(X_test)

In [35]:

test_pred_rf

Out[35]:

array([645.19767757, 551.3456464 , 639.02808225, ..., 639.0451265 ,
       360.25043723,  24.2362344 ])

모델링(2) XGBoostRegressor

In [36]:

import xgboost as xgb

그리드서치

In [37]:

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV

xgbr = xgb.XGBRegressor()
param_grid = {'learning_rate': [0.001, 0.01, 0.1, 0.2],
              'max_depth': [3 ,4 ,5, 6, 7]}

ss = ShuffleSplit(test_size=0.3, random_state=0)
grid_xgb_cv = GridSearchCV(xgbr, return_train_score=True,
                          param_grid=param_grid,
                          cv=ss,
                          verbose=0, scoring='neg_mean_absolute_error',
                          n_jobs=-1)
grid_xgb_cv.fit(X, y)

print('최적 매개변수 조합:', grid_xgb_cv.best_params_)
print('최고의 교차 검증 점수:', grid_xgb_cv.best_score_)

최적 매개변수 조합: {'learning_rate': 0.1, 'max_depth': 3}
최고의 교차 검증 점수: -30.630914427899672

최적 모델을 best_xgb에 저장하고 NMAE 확인

In [38]:

best_xgb = grid_xgb_cv.best_estimator_
y_pred_xgb = best_xgb.predict(X)
print(f'NMAE: {NMAE(y, y_pred_xgb)}')

NMAE: 0.06664120790058173

test data로 예측하기

In [40]:

test_pred_xgb = best_xgb.predict(X_test)

모델 앙상블

In [41]:

final_pred = 0.5 * test_pred_rf + 0.5 * test_pred_xgb

In [48]:

final_pred

Out[48]:

array([644.03145597, 548.1806357 , 636.13910216, ..., 631.92289284,
       360.16364025,  26.20572459])

교차 검증 및 변수 중요도 확인

랜덤포레스트 교차검증

In [49]:

from sklearn.model_selection import cross_validate
cv_scores = cross_validate(rf, X, y,
                           cv=5, return_train_score=True,
                           n_jobs=-1)

cv_scores_2 = cross_validate(best_rf, X, y,
                             cv=5, return_train_score=True,
                             n_jobs=-1)

print('첫 번째 교차 검증 결과')
print('훈련 세트 점수: {0:.3f}'.format(np.mean(cv_scores['train_score'])))
print('검증 세트 점수: {0:.3f}'.format(np.mean(cv_scores['test_score'])))
print('\n')
print('두 번째 교차 검증 결과')
print('훈련 세트 점수: {0:.3f}'.format(np.mean(cv_scores_2['train_score'])))
print('검증 세트 점수: {0:.3f}'.format(np.mean(cv_scores_2['test_score'])))

첫 번째 교차 검증 결과
훈련 세트 점수: 0.995
검증 세트 점수: 0.967


두 번째 교차 검증 결과
훈련 세트 점수: 0.979
검증 세트 점수: 0.970

XGBoost 교차검증

In [50]:

from sklearn.model_selection import cross_validate
cv_scores = cross_validate(xgbr, X, y,
                           cv=5, return_train_score=True,
                           n_jobs=-1)

cv_scores_2 = cross_validate(best_xgb, X, y,
                             cv=5, return_train_score=True,
                             n_jobs=-1)

print('첫 번째 교차 검증 결과')
print('훈련 세트 점수: {0:.3f}'.format(np.mean(cv_scores['train_score'])))
print('검증 세트 점수: {0:.3f}'.format(np.mean(cv_scores['test_score'])))
print('\n')
print('두 번째 교차 검증 결과')
print('훈련 세트 점수: {0:.3f}'.format(np.mean(cv_scores_2['train_score'])))
print('검증 세트 점수: {0:.3f}'.format(np.mean(cv_scores_2['test_score'])))

첫 번째 교차 검증 결과
훈련 세트 점수: 0.998
검증 세트 점수: 0.963


두 번째 교차 검증 결과
훈련 세트 점수: 0.978
검증 세트 점수: 0.970

랜덤포레스트 피처 중요도

In [51]:

importances = best_rf.feature_importances_

features = X.columns

new = pd.DataFrame({'피처': np.array(features), '중요도': importances})

new.sort_values(by='중요도', ascending=False)

Out[51]:

	피처	중요도
1	period_2 새순	0.206323
2	period_3 새순	0.203110
0	period_1 새순	0.155636
11	새순 max	0.147918
3	period_4 새순	0.094341
4	period_5 새순	0.062433
10	새순 diff	0.045945
6	period_7 새순	0.034997
5	period_6 새순	0.031448
7	period_8 새순	0.009795
12	새순 min	0.006348
9	period_10 새순	0.000857
8	period_9 새순	0.000478
13	나무부피	0.000371

XGBoost 피처 중요도

In [52]:

importances = best_xgb.feature_importances_

features = X.columns

new = pd.DataFrame({'피처': np.array(features), '중요도': importances})

new.sort_values(by='중요도', ascending=False)

Out[52]:

	피처	중요도
0	period_1 새순	0.558243
1	period_2 새순	0.244044
11	새순 max	0.082362
2	period_3 새순	0.047165
10	새순 diff	0.043126
3	period_4 새순	0.010819
12	새순 min	0.007358
4	period_5 새순	0.002545
6	period_7 새순	0.000885
9	period_10 새순	0.000765
8	period_9 새순	0.000737
13	나무부피	0.000658
5	period_6 새순	0.000652
7	period_8 새순	0.000642

[회귀] 감귤 착과량 예측 AI 경진대회