정렬, Aggregation 함수, GroupBy 적용¶
sort_values( )¶
- 주요 파라미터: by='칼럼명', ascending=True(기본 오름차순), inplace=False(호출한 df는 그대로 유지하면서 정렬된 df를 결과로 반환)
In [1]:
import pandas as pd
titanic_df = pd.read_csv('./titanic_train.csv')
titanic_df
Out[1]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
In [13]:
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [7]:
titanic_sorted = titanic_df.sort_values(by=['Name'])
titanic_sorted.head(3)
Out[7]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
845 | 846 | 0 | 3 | Abbing, Mr. Anthony | male | 42.0 | 0 | 0 | C.A. 5547 | 7.55 | NaN | S |
746 | 747 | 0 | 3 | Abbott, Mr. Rossmore Edward | male | 16.0 | 1 | 1 | C.A. 2673 | 20.25 | NaN | S |
279 | 280 | 1 | 3 | Abbott, Mrs. Stanton (Rosa Hunt) | female | 35.0 | 1 | 1 | C.A. 2673 | 20.25 | NaN | S |
In [9]:
titanic_sorted = titanic_df.sort_values(by=['Name', 'Fare'], ascending=False)
titanic_sorted.head(3)
Out[9]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
868 | 869 | 0 | 3 | van Melkebeke, Mr. Philemon | male | NaN | 0 | 0 | 345777 | 9.5000 | NaN | S |
153 | 154 | 0 | 3 | van Billiard, Mr. Austin Blyler | male | 40.5 | 0 | 2 | A/5. 851 | 14.5000 | NaN | S |
361 | 362 | 0 | 2 | del Carlo, Mr. Sebastiano | male | 29.0 | 1 | 0 | SC/PARIS 2167 | 27.7208 | NaN | C |
Aggregation 함수 적용¶
- min, max, sum, mean, count와 같은 함수
- df에 count()를 적용하면 모든 칼럼에 count() 결과를 반환, 대신 count는 Null값을 반영하지 않음
- 때문에 Null 값이 있는 칼럼은 count() 결과값이 다름
In [14]:
titanic_df.count()
Out[14]:
PassengerId 891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64
In [15]:
titanic_df[['Age', 'Fare']].mean()
Out[15]:
Age 29.699118
Fare 32.204208
dtype: float64
In [20]:
titanic_groupby = titanic_df.groupby(by='Pclass')
titanic_groupby
Out[20]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F88543DC60>
In [23]:
#반환된 결과에 aggregation 함수 적용하면 대상 칼럼을 제외한 모든 칼럼에 aggregation 함수가 적용됌
#특정 칼럼에 대한 min, max, count 등을 알고싶으면 groupby로 필터링 후 함수 적용
titanic_groupby = titanic_df.groupby(by='Pclass').count()
titanic_groupby
Out[23]:
PassengerId | Survived | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Pclass | |||||||||||
1 | 216 | 216 | 216 | 216 | 186 | 216 | 216 | 216 | 216 | 176 | 214 |
2 | 184 | 184 | 184 | 184 | 173 | 184 | 184 | 184 | 184 | 16 | 184 |
3 | 491 | 491 | 491 | 491 | 355 | 491 | 491 | 491 | 491 | 12 | 491 |
In [24]:
titanic_groupby = titanic_df.groupby(by='Pclass')['PassengerId', 'Survived'].count()
titanic_groupby
C:\Users\Administrator\AppData\Local\Temp\ipykernel_10184\3732055399.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
titanic_groupby = titanic_df.groupby(by='Pclass')['PassengerId', 'Survived'].count()
Out[24]:
PassengerId | Survived | |
---|---|---|
Pclass | ||
1 | 216 | 216 |
2 | 184 | 184 |
3 | 491 | 491 |
In [26]:
titanic_df.groupby('Pclass')['Age'].agg([min, max])
Out[26]:
min | max | |
---|---|---|
Pclass | ||
1 | 0.92 | 80.0 |
2 | 0.67 | 70.0 |
3 | 0.42 | 74.0 |
In [27]:
#agg( )내에 딕셔너리 형태로 aggregation이 적용될 칼럼들과 함수 입력
agg_format = {'Age':'max', 'SibSp':'sum', 'Fare':'mean'}
titanic_df.groupby('Pclass').agg(agg_format)
Out[27]:
Age | SibSp | Fare | |
---|---|---|---|
Pclass | |||
1 | 80.0 | 90 | 84.154687 |
2 | 70.0 | 74 | 20.662183 |
3 | 74.0 | 302 | 13.675550 |
'Python > Pandas' 카테고리의 다른 글
판다스 pandas (7) - lambda, apply, map (0) | 2023.04.30 |
---|---|
판다스 pandas (6) - isna( ), fillna( ) (0) | 2023.04.30 |
판다스 pandas (4) - indexing, iloc, loc, boolean (0) | 2023.04.30 |
판다스 pandas (3) - Index 객체, reset_index (0) | 2023.04.30 |
판다스 pandas (2) - ndarray, list, dict <-> df 변환 (0) | 2023.04.25 |