Data Science/파이썬 머신러닝 완벽 가이드

[Dacon] 보스턴 집값 예측

얆생 2023. 4. 20. 22:01

https://dacon.io/competitions/open/236065/overview/description

보스턴 집값 예측 경진대회 - DACON

분석시각화 대회 코드 공유 게시물은 내용 확인 후 좋아요(투표) 가능합니다.

dacon.io

데이콘에서 교육용 연습대회에 참가해보았다.

보스턴 집값 경진대회¶

https://dacon.io/competitions/open/236065/overview/description

CRIM: 도시별 1인당 범죄율

ZN: 25,000 피트를 초과하는 주거용 토지의 비율

NDUS: 비상업 면적의 비율

CHAS: 찰스강에 대한 더미 변수(강의 경계에 위치한 경우는 1, 아니면 0)

NOX: 일산화질소 농도

RM: 주택당 평균 방의 개수

AGE: 1940년 이전에 건축된 자가주택의 비율

DIS: 5개의 보스턴 고용 센터와의 거리

RAD: 고속도로 접근성 지수

TAX: 10,000달러당 재산세율

PTRATIO: 도시별 교사와 학생 수 비율

B: 마을의 흑인 거주 비율

LSTAT: 하위 계층의 비율

MEDV: 본인 소유의 주택 가격(중앙값, 단위: 천달러)

In [1]:

import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols, glm

house = pd.read_csv('./Boston Housing 데이터/train.csv')
house

Out[1]:

	ID	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	TRAIN_000	0.04819	80.0	3.64	0.0	0.392	6.108	32.0	9.2203	1.0	315.0	16.4	392.89	6.57	21.9
1	TRAIN_001	1.42502	0.0	19.58	0.0	0.871	6.510	100.0	1.7659	5.0	403.0	14.7	364.31	7.39	23.3
2	TRAIN_002	0.01778	95.0	1.47	0.0	0.403	7.135	13.9	7.6534	3.0	402.0	17.0	384.30	4.45	32.9
3	TRAIN_003	9.51363	0.0	18.10	0.0	0.713	6.728	94.1	2.4961	24.0	666.0	20.2	6.68	18.71	14.9
4	TRAIN_004	1.65660	0.0	19.58	0.0	0.871	6.122	97.3	1.6180	5.0	403.0	14.7	372.80	14.10	21.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
450	TRAIN_450	0.33983	22.0	5.86	0.0	0.431	6.108	34.9	8.0555	7.0	330.0	19.1	390.18	9.16	24.3
451	TRAIN_451	3.83684	0.0	18.10	0.0	0.770	6.251	91.1	2.2955	24.0	666.0	20.2	350.65	14.19	19.9
452	TRAIN_452	0.20746	0.0	27.74	0.0	0.609	5.093	98.0	1.8226	4.0	711.0	20.1	318.43	29.68	8.1
453	TRAIN_453	0.09299	0.0	25.65	0.0	0.581	5.961	92.9	2.0869	2.0	188.0	19.1	378.09	17.93	20.5
454	TRAIN_454	25.94060	0.0	18.10	0.0	0.679	5.304	89.1	1.6475	24.0	666.0	20.2	127.36	26.64	10.4

455 rows × 15 columns

In [2]:

house.describe()

Out[2]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
count	455.000000	455.000000	455.000000	455.000000	455.000000	455.000000	455.000000	455.000000	455.000000	455.000000	455.000000	455.000000	455.000000	455.000000
mean	3.659998	11.165934	11.092747	0.065934	0.555160	6.285352	68.320440	3.788439	9.551648	407.892308	18.453846	356.285714	12.604967	22.574725
std	8.888075	22.953757	6.821697	0.248440	0.116524	0.699346	27.893807	2.082031	8.718624	169.446805	2.149137	92.664335	7.090388	9.120759
min	0.006320	0.000000	0.460000	0.000000	0.389000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.079230	0.000000	5.160000	0.000000	0.449000	5.881000	44.700000	2.100350	4.000000	278.000000	17.400000	375.990000	6.970000	17.100000
50%	0.245220	0.000000	9.690000	0.000000	0.538000	6.208000	76.700000	3.199200	5.000000	330.000000	19.000000	391.700000	11.280000	21.200000
75%	3.685665	12.500000	18.100000	0.000000	0.624000	6.627000	93.850000	5.164950	24.000000	666.000000	20.200000	396.375000	16.695000	25.100000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	36.980000	50.000000

In [3]:

print(house.groupby('CHAS')[['MEDV']].describe().unstack('CHAS'))

             CHAS
MEDV  count  0.0     425.000000
             1.0      30.000000
      mean   0.0      22.134824
             1.0      28.806667
      std    0.0       8.762722
             1.0      11.690165
      min    0.0       5.000000
             1.0      13.400000
      25%    0.0      16.700000
             1.0      21.700000
      50%    0.0      21.000000
             1.0      23.850000
      75%    0.0      24.800000
             1.0      33.175000
      max    0.0      50.000000
             1.0      50.000000
dtype: float64

In [4]:

print(house.RAD.value_counts())
print(house.groupby('RAD')[['MEDV']].describe().unstack('RAD'))

24.0    119
5.0     106
4.0      96
3.0      36
6.0      25
2.0      21
8.0      20
1.0      18
7.0      14
Name: RAD, dtype: int64
             RAD 
MEDV  count  1.0      18.0
             2.0      21.0
             3.0      36.0
             4.0      96.0
             5.0     106.0
                     ...  
      max    5.0      50.0
             6.0      24.8
             7.0      42.8
             8.0      50.0
             24.0     50.0
Length: 72, dtype: float64

In [5]:

#상관관계 분석
house.corr()

C:\Users\Administrator\AppData\Local\Temp\ipykernel_28408\3222390172.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  house.corr()

Out[5]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
CRIM	1.000000	-0.196246	0.404043	-0.053530	0.410551	-0.212673	0.349981	-0.375011	0.614488	0.570890	0.287082	-0.384054	0.444276	-0.386511
ZN	-0.196246	1.000000	-0.525693	-0.025098	-0.505469	0.317844	-0.556758	0.647406	-0.306968	-0.316827	-0.388047	0.181655	-0.416454	0.376084
INDUS	0.404043	-0.525693	1.000000	0.057221	0.759181	-0.391466	0.638347	-0.702260	0.603900	0.730394	0.373639	-0.369784	0.608342	-0.483034
CHAS	-0.053530	-0.025098	0.057221	1.000000	0.081192	0.121848	0.069445	-0.089970	-0.006660	-0.033841	-0.118460	0.043095	-0.064032	0.181734
NOX	0.410551	-0.505469	0.759181	0.081192	1.000000	-0.306085	0.730537	-0.762721	0.609330	0.666784	0.175569	-0.385323	0.597259	-0.433293
RM	-0.212673	0.317844	-0.391466	0.121848	-0.306085	1.000000	-0.245285	0.210719	-0.214426	-0.299118	-0.354118	0.131836	-0.611582	0.697426
AGE	0.349981	-0.556758	0.638347	0.069445	0.730537	-0.245285	1.000000	-0.741336	0.462316	0.517079	0.250436	-0.281139	0.603303	-0.384123
DIS	-0.375011	0.647406	-0.702260	-0.089970	-0.762721	0.210719	-0.741336	1.000000	-0.496010	-0.539764	-0.227028	0.302359	-0.498786	0.260402
RAD	0.614488	-0.306968	0.603900	-0.006660	0.609330	-0.214426	0.462316	-0.496010	1.000000	0.908535	0.467822	-0.454256	0.486773	-0.383792
TAX	0.570890	-0.316827	0.730394	-0.033841	0.666784	-0.299118	0.517079	-0.539764	0.908535	1.000000	0.461456	-0.450327	0.547234	-0.470998
PTRATIO	0.287082	-0.388047	0.373639	-0.118460	0.175569	-0.354118	0.250436	-0.227028	0.467822	0.461456	1.000000	-0.183874	0.362355	-0.508638
B	-0.384054	0.181655	-0.369784	0.043095	-0.385323	0.131836	-0.281139	0.302359	-0.454256	-0.450327	-0.183874	1.000000	-0.366402	0.332199
LSTAT	0.444276	-0.416454	0.608342	-0.064032	0.597259	-0.611582	0.603303	-0.498786	0.486773	0.547234	0.362355	-0.366402	1.000000	-0.738124
MEDV	-0.386511	0.376084	-0.483034	0.181734	-0.433293	0.697426	-0.384123	0.260402	-0.383792	-0.470998	-0.508638	0.332199	-0.738124	1.000000

In [15]:

import seaborn as sns
sns.heatmap(data=house.corr(), cmap='Blues')

C:\Users\Administrator\AppData\Local\Temp\ipykernel_28408\3515544278.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  sns.heatmap(data=house.corr(), cmap='Blues')

Out[15]:

<AxesSubplot: >

MEDV랑 RM: 강한 양의 상관관계 >> 주택당 평균 방의 개수가 많을 수록 집값이 비싸다

MEDV랑 LSTAT: 강한 음의 상관관계 >> 하위계층의 비율이 높을수록 집값이 싸다.

In [6]:

#회귀분석

my_formula = 'MEDV ~ CRIM + ZN + INDUS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT'
lm = ols(my_formula, data=house).fit()
print(lm.summary())
print('\nQuantities you can extract from the result: \n%s' % dir(lm))
print('\nCoefficients: \n%s' % lm.params)
print('\nCoefficient Std Errors: \n%s' % lm.bse)
print('\nAdj. R-sqaured: \n%.2f' % lm.rsquared_adj)
print('\nF-statistic: %.1f P-value: %.2f' % (lm.fvalue, lm.f_pvalue))
print('\nNumber of obs: %d Number of fitted values: %s' % (lm.nobs, len(lm.fittedvalues)))

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MEDV   R-squared:                       0.738
Model:                            OLS   Adj. R-squared:                  0.731
Method:                 Least Squares   F-statistic:                     103.7
Date:                Thu, 20 Apr 2023   Prob (F-statistic):          3.10e-120
Time:                        19:04:03   Log-Likelihood:                -1346.3
No. Observations:                 455   AIC:                             2719.
Df Residuals:                     442   BIC:                             2772.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     36.6451      5.365      6.830      0.000      26.101      47.189
CRIM          -0.1145      0.033     -3.477      0.001      -0.179      -0.050
ZN             0.0462      0.014      3.223      0.001       0.018       0.074
INDUS          0.0682      0.064      1.059      0.290      -0.058       0.195
NOX          -17.5083      3.957     -4.425      0.000     -25.285      -9.732
RM             3.9141      0.438      8.944      0.000       3.054       4.774
AGE            0.0030      0.014      0.220      0.826      -0.024       0.030
DIS           -1.3816      0.206     -6.693      0.000      -1.787      -0.976
RAD            0.3225      0.068      4.710      0.000       0.188       0.457
TAX           -0.0131      0.004     -3.392      0.001      -0.021      -0.006
PTRATIO       -1.0375      0.138     -7.508      0.000      -1.309      -0.766
B              0.0093      0.003      3.319      0.001       0.004       0.015
LSTAT         -0.5278      0.053     -9.910      0.000      -0.632      -0.423
==============================================================================
Omnibus:                      187.867   Durbin-Watson:                   1.973
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1007.188
Skew:                           1.724   Prob(JB):                    1.96e-219
Kurtosis:                       9.421   Cond. No.                     1.50e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.5e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Quantities you can extract from the result: 
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abat_diagonal', '_cache', '_data_attr', '_data_in_cache', '_get_robustcov_results', '_is_nested', '_use_t', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'diagn', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'info_criteria', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']

Coefficients: 
Intercept    36.645120
CRIM         -0.114515
ZN            0.046238
INDUS         0.068209
NOX         -17.508327
RM            3.914144
AGE           0.003044
DIS          -1.381636
RAD           0.322528
TAX          -0.013146
PTRATIO      -1.037455
B             0.009260
LSTAT        -0.527760
dtype: float64

Coefficient Std Errors: 
Intercept    5.364928
CRIM         0.032940
ZN           0.014348
INDUS        0.064428
NOX          3.956786
RM           0.437609
AGE          0.013845
DIS          0.206436
RAD          0.068470
TAX          0.003876
PTRATIO      0.138185
B            0.002790
LSTAT        0.053254
dtype: float64

Adj. R-sqaured: 
0.73

F-statistic: 103.7 P-value: 0.00

Number of obs: 455 Number of fitted values: 455

RM(주택당 평균 방의 개수)이 1단위 증가하면 MEDV(집값)이 평균 3.914 증가함

LSTAT(하위계층 비율)이 1단위 증가하면 MEDV(집값)이 평균 0.527 하락함

In [7]:

#예측하기

test_data = pd.read_csv('./Boston Housing 데이터/test.csv')
predicted = lm.predict(test_data)
predicted_rounded = [round(score, 2) for score in predicted]
print(predicted)

In [9]:

import pandas as pd

submit = pd.read_csv('./Boston Housing 데이터/sample_submission.csv')

In [11]:

submit['MEDV'] = predicted
submit.head()

Out[11]:

	ID	MEDV
0	TEST_000	22.724850
1	TEST_001	32.277223
2	TEST_002	19.828197
3	TEST_003	24.575220
4	TEST_004	26.172434

In [13]:

submit.to_csv('./submit.csv', index=False)

출력값을 csv파일로 저장하여 제출한 결과, 리더보드에서 36등을 하였다.