타이타닉 생존자 예측¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

titanic_df = pd.read_csv('./titanic_train.csv')
titanic_df.head(3)

Out[1]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

In [2]:

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In [3]:

titanic_df['Age'].fillna(titanic_df['Age'].mean(), inplace=True)
titanic_df['Cabin'].fillna('N', inplace=True)
titanic_df['Embarked'].fillna('N', inplace=True)

print('데이터 세트 Null 값 수:', titanic_df.isna().sum().sum())

데이터 세트 Null 값 수: 0

In [4]:

print('Sex 값 분포: \n', titanic_df['Sex'].value_counts())
print('Cabin 값 분포: \n',titanic_df['Cabin'].value_counts())
print('Embarked 값 분포: \n',titanic_df['Embarked'].value_counts())

Sex 값 분포: 
 male      577
female    314
Name: Sex, dtype: int64
Cabin 값 분포: 
 N              687
C23 C25 C27      4
G6               4
B96 B98          4
C22 C26          3
              ... 
E34              1
C7               1
C54              1
E36              1
C148             1
Name: Cabin, Length: 148, dtype: int64
Embarked 값 분포: 
 S    644
C    168
Q     77
N      2
Name: Embarked, dtype: int64

In [5]:

#Cabin의 경우 선실등급을 나타내는 알파벳만 중요해보임
titanic_df['Cabin'] = titanic_df['Cabin'].str[:1]
titanic_df['Cabin'].head(3)

Out[5]:

0    N
1    C
2    N
Name: Cabin, dtype: object

성별에 따른 생존자 수 비교

In [6]:

titanic_df.groupby(['Sex', 'Survived'])['Survived'].count()
#Survived는 결정 클래스 값

Out[6]:

Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

In [7]:

sns.barplot(titanic_df, x='Sex', y='Survived')

Out[7]:

<AxesSubplot: xlabel='Sex', ylabel='Survived'>

객실 등급에 따른 생존자 수 비교

In [8]:

titanic_df.groupby(['Pclass', 'Survived'])['Survived'].count()

Out[8]:

Pclass  Survived
1       0            80
        1           136
2       0            97
        1            87
3       0           372
        1           119
Name: Survived, dtype: int64

In [9]:

sns.barplot(titanic_df, x='Pclass', y='Survived', hue='Sex')

Out[9]:

<AxesSubplot: xlabel='Pclass', ylabel='Survived'>

female의 경우 1,2등석 간에는 생존 확률의 차이가 크지 않지만, 3등석의 경우 상대적으로 많이 떨어짐

male의 경우 2,3등석에 비해 1등석의 생존 확률이 월등이 높음

In [10]:

#apply lambda 사용

def get_category(age):
    cat =''
    if age <= -1 : cat = 'unknown'
    elif age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
        
    return cat

#그래프 크기 더 크게 설정
plt.figure(figsize=(10,6))

# x축의 값을 순차적으로 표시하기 위함
group_names = ['unknown', 'Baby', 'Child', 'Teenager', 'Student',  'Young Adult', 'Adult', 'Elderly']

#get_category 함수를 반환값으로 지정
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
sns.barplot(titanic_df, x='Age_cat', y='Survived', hue='Sex', order = group_names)
titanic_df.drop('Age_cat', axis=1, inplace=True)

female의 경우 Child 그룹이 생존 확률이 상대적으로 많이 떨어짐, Elderly 그룹은 매우 높았음

male의 경우 Baby 그룹이 상대적으로 생존 확률이 가장 높았고, Teenager부터 Elderly까지는 고르게 낮은 확률을 보임

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	7	3
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	2	0
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	7	3
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	2	3
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	7	3

'Data Science > Data Analysis' 카테고리의 다른 글

데이터셋 결측치 대체하기(선형보간법, 평균대치법, fillna()..) (0)	2024.04.01
[EDA] 탐색적 데이터 분석, 와인 데이터 Wine Data (0)	2023.04.19
[EDA] 탐색적 데이터 분석 개념 (0)	2023.03.29

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[kaggle] 타이타닉 생존률 예측하기 (1) - EDA

타이타닉 생존자 예측¶

'Data Science > Data Analysis' 카테고리의 다른 글

타이타닉 생존자 예측¶

'Data Science > Data Analysis' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역