Groupby and Pivot Tables

Contents

Introduction to Python - Groupby and Pivot Tables

introml.analyticsdojo.com

10. Groupby and Pivot Tables#

import numpy as np 
import pandas as pd 

# Input data files are available in the "../input/" directory.
# Let's input them into a Pandas DataFrame
train = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
test  = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')

10.1. Groupby#

Often it is useful to see statistics by different classes.
Can be used to examine different subpopulations

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

print(train.dtypes)

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

#What does this tell us?  
train.groupby(['Sex']).Survived.mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

#What does this tell us?  
train.groupby(['Sex','Pclass']).Survived.mean()

Sex     Pclass
female  1         0.968085
        2         0.921053
        3         0.500000
male    1         0.368852
        2         0.157407
        3         0.135447
Name: Survived, dtype: float64

#What does this tell us?  Here it doesn't look so clear. We could separate by set age ranges.
train.groupby(['Sex','Age']).Survived.mean()

Sex     Age  
female  0.75     1.000000
        1.00     1.000000
        2.00     0.333333
        3.00     0.500000
        4.00     1.000000
                   ...   
male    70.00    0.000000
        70.50    0.000000
        71.00    0.000000
        74.00    0.000000
        80.00    1.000000
Name: Survived, Length: 145, dtype: float64

10.2. Combining Multiple Operations#

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

s = train.groupby(['Sex','Pclass'], as_index=False).Survived.sum()
s['PerSurv'] = train.groupby(['Sex','Pclass'], as_index=False).Survived.mean().Survived
s['PerSurv']=s['PerSurv']*100
s['Count'] = train.groupby(['Sex','Pclass'], as_index=False).Survived.count().Survived
survived =s.Survived
s

	Sex	Pclass	Survived	PerSurv	Count
0	female	1	91	96.808511	94
1	female	2	70	92.105263	76
2	female	3	72	50.000000	144
3	male	1	45	36.885246	122
4	male	2	17	15.740741	108
5	male	3	47	13.544669	347

#What does this tell us?  
spmean=train.groupby(['Sex','Pclass']).Survived.mean()
spcount=train.groupby(['Sex','Pclass']).Survived.sum()
spsum=train.groupby(['Sex','Pclass']).Survived.count()
spsum

Sex     Pclass
female  1          94
        2          76
        3         144
male    1         122
        2         108
        3         347
Name: Survived, dtype: int64

10.3. Pivot Tables#

A pivot table is a data summarization tool, much easier than the syntax of groupBy.
It can be used to that sum, sort, averge, count, over a pandas dataframe.
Download and open data in excel to appreciate the ways that you can use Pivot Tables.

#List the index and the functions you want to aggregage by. 
pd.pivot_table(train,index=["Sex","Pclass"],values=["Survived"],aggfunc=['count','sum','mean'])

		count	sum	mean
		Survived	Survived	Survived
Sex	Pclass
female	1	94	91	0.968085
	2	76	70	0.921053
	3	144	72	0.500000
male	1	122	45	0.368852
	2	108	17	0.157407
	3	347	47	0.135447