Introduction to Python - Introduction to Apply Function
rpi.analyticsdojo.com
8. Introduction to Apply Function#
Don’t loop over a dataframe.
Instead, us the apply function to process a function across each value.
import pandas as pd
df=pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
8.1. Make it easy with the lambda function.#
Create a value for
Age-squared
.
df['age-squared']=df['Age'].apply(lambda x: x**2)
8.2. Or define an entire function.#
Define a function to get the title from the name.
Always test your function with a single entry, not the apply.
def get_title(x):
x = str(x)
x = x.split(',') #Split at the comma
x = x[1].strip() #remove any leading spaces
x = x.split(' ')#Split at the spaces
return x[0]
#Always test your function with a single entry, not the apply.
get_title('Dooley, Mr. Patrick')
'Mr.'
df['Title']=df['Name'].apply(get_title)
df[['Name','Title']]
Name | Title | |
---|---|---|
0 | Braund, Mr. Owen Harris | Mr. |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | Mrs. |
2 | Heikkinen, Miss. Laina | Miss. |
3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | Mrs. |
4 | Allen, Mr. William Henry | Mr. |
... | ... | ... |
886 | Montvila, Rev. Juozas | Rev. |
887 | Graham, Miss. Margaret Edith | Miss. |
888 | Johnston, Miss. Catherine Helen "Carrie" | Miss. |
889 | Behr, Mr. Karl Howell | Mr. |
890 | Dooley, Mr. Patrick | Mr. |
891 rows × 2 columns
df['Title'].unique()
array(['Mr.', 'Mrs.', 'Miss.', 'Master.', 'Don.', 'Rev.', 'Dr.', 'Mme.',
'Ms.', 'Major.', 'Lady.', 'Sir.', 'Mlle.', 'Col.', 'Capt.', 'the',
'Jonkheer.'], dtype=object)
df['Title'].value_counts()
Mr. 517
Miss. 182
Mrs. 125
Master. 40
Dr. 7
Rev. 6
Major. 2
Col. 2
Mlle. 2
Lady. 1
Capt. 1
Jonkheer. 1
Don. 1
Ms. 1
the 1
Mme. 1
Sir. 1
Name: Title, dtype: int64
8.3. Pass Additional Values#
You can even use things that pass additional values.
RECODE_MRS=['Lady.','Mme.']
RECODE_MISS=['Ms.']
RECODE_MR=['Sir.','the','Don.','Jonkheer.','Capt.']
def get_title2(x,recode_mrs, recode_miss, recode_mr):
x = str(x)
x = x.split(',') #Split at the comma
x = x[1].strip() #remove any leading spaces
x = x.split(' ')#Split at the spaces
x = x[0] #select the first word.
if x in recode_mrs:
x='Mrs.'
elif x in recode_miss:
x='Miss.'
elif x in recode_mr:
x='Mr.'
return x
#Always test your function with a single entry, not the apply.
get_title('Dooley, Sir., Patrick', recode_mrs=RECODE_MRS, recode_miss=RECODE_MISS, recode_mr=RECODE_MR)
'Mr.'
df['Title']=df['Name'].apply(get_title2,recode_mrs=RECODE_MRS, recode_miss=RECODE_MISS, recode_mr=RECODE_MR )
df[['Name','Title']]
Name | Title | |
---|---|---|
0 | Braund, Mr. Owen Harris | Mr. |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | Mrs. |
2 | Heikkinen, Miss. Laina | Miss. |
3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | Mrs. |
4 | Allen, Mr. William Henry | Mr. |
... | ... | ... |
886 | Montvila, Rev. Juozas | Rev. |
887 | Graham, Miss. Margaret Edith | Miss. |
888 | Johnston, Miss. Catherine Helen "Carrie" | Miss. |
889 | Behr, Mr. Karl Howell | Mr. |
890 | Dooley, Mr. Patrick | Mr. |
891 rows × 2 columns
df['Title'].value_counts()
Mr. 521
Miss. 183
Mrs. 127
Master. 40
Dr. 7
Rev. 6
Mlle. 2
Major. 2
Col. 2
Capt. 1
Name: Title, dtype: int64
8.4. Using Values from more than one column#
Apply somethign on the entire dataframe if calcs involve more than once column.
def complex_process(row):
return row['Sex']+str(row['Age'])
df.apply(complex_process, axis = 1)
0 male22.0
1 female38.0
2 female26.0
3 female35.0
4 male35.0
...
886 male27.0
887 female19.0
888 femalenan
889 male26.0
890 male32.0
Length: 891, dtype: object