Introduction to Python - Groupby and Pivot Tables
introml.analyticsdojo.com
10. Groupby and Pivot Tables#
import numpy as np
import pandas as pd
# Input data files are available in the "../input/" directory.
# Let's input them into a Pandas DataFrame
train = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')
10.1. Groupby#
Often it is useful to see statistics by different classes.
Can be used to examine different subpopulations
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
print(train.dtypes)
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
#What does this tell us?
train.groupby(['Sex']).Survived.mean()
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
#What does this tell us?
train.groupby(['Sex','Pclass']).Survived.mean()
Sex Pclass
female 1 0.968085
2 0.921053
3 0.500000
male 1 0.368852
2 0.157407
3 0.135447
Name: Survived, dtype: float64
#What does this tell us? Here it doesn't look so clear. We could separate by set age ranges.
train.groupby(['Sex','Age']).Survived.mean()
Sex Age
female 0.75 1.000000
1.00 1.000000
2.00 0.333333
3.00 0.500000
4.00 1.000000
...
male 70.00 0.000000
70.50 0.000000
71.00 0.000000
74.00 0.000000
80.00 1.000000
Name: Survived, Length: 145, dtype: float64
10.2. Combining Multiple Operations#
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
s = train.groupby(['Sex','Pclass'], as_index=False).Survived.sum()
s['PerSurv'] = train.groupby(['Sex','Pclass'], as_index=False).Survived.mean().Survived
s['PerSurv']=s['PerSurv']*100
s['Count'] = train.groupby(['Sex','Pclass'], as_index=False).Survived.count().Survived
survived =s.Survived
s
Sex | Pclass | Survived | PerSurv | Count | |
---|---|---|---|---|---|
0 | female | 1 | 91 | 96.808511 | 94 |
1 | female | 2 | 70 | 92.105263 | 76 |
2 | female | 3 | 72 | 50.000000 | 144 |
3 | male | 1 | 45 | 36.885246 | 122 |
4 | male | 2 | 17 | 15.740741 | 108 |
5 | male | 3 | 47 | 13.544669 | 347 |
#What does this tell us?
spmean=train.groupby(['Sex','Pclass']).Survived.mean()
spcount=train.groupby(['Sex','Pclass']).Survived.sum()
spsum=train.groupby(['Sex','Pclass']).Survived.count()
spsum
Sex Pclass
female 1 94
2 76
3 144
male 1 122
2 108
3 347
Name: Survived, dtype: int64
10.3. Pivot Tables#
A pivot table is a data summarization tool, much easier than the syntax of groupBy.
It can be used to that sum, sort, averge, count, over a pandas dataframe.
Download and open data in excel to appreciate the ways that you can use Pivot Tables.
#List the index and the functions you want to aggregage by.
pd.pivot_table(train,index=["Sex","Pclass"],values=["Survived"],aggfunc=['count','sum','mean'])
count | sum | mean | ||
---|---|---|---|---|
Survived | Survived | Survived | ||
Sex | Pclass | |||
female | 1 | 94 | 91 | 0.968085 |
2 | 76 | 70 | 0.921053 | |
3 | 144 | 72 | 0.500000 | |
male | 1 | 122 | 45 | 0.368852 |
2 | 108 | 17 | 0.157407 | |
3 | 347 | 47 | 0.135447 |