AnalyticsDojo

Introduction to Python - Introduction to Apply Function

rpi.analyticsdojo.com

8. Introduction to Apply Function#

  • Don’t loop over a dataframe.

  • Instead, us the apply function to process a function across each value.

import pandas as pd
df=pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

8.1. Make it easy with the lambda function.#

  • Create a value for Age-squared.

df['age-squared']=df['Age'].apply(lambda x: x**2)

8.2. Or define an entire function.#

  • Define a function to get the title from the name.

  • Always test your function with a single entry, not the apply.

def get_title(x):
  
  x = str(x)
  x = x.split(',') #Split at the comma
  x = x[1].strip() #remove any leading spaces
  x = x.split(' ')#Split at the spaces
  return x[0]

#Always test your function with a single entry, not the apply.
get_title('Dooley, Mr. Patrick')
'Mr.'
df['Title']=df['Name'].apply(get_title)
df[['Name','Title']]
Name Title
0 Braund, Mr. Owen Harris Mr.
1 Cumings, Mrs. John Bradley (Florence Briggs Th... Mrs.
2 Heikkinen, Miss. Laina Miss.
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs.
4 Allen, Mr. William Henry Mr.
... ... ...
886 Montvila, Rev. Juozas Rev.
887 Graham, Miss. Margaret Edith Miss.
888 Johnston, Miss. Catherine Helen "Carrie" Miss.
889 Behr, Mr. Karl Howell Mr.
890 Dooley, Mr. Patrick Mr.

891 rows × 2 columns

df['Title'].unique()
array(['Mr.', 'Mrs.', 'Miss.', 'Master.', 'Don.', 'Rev.', 'Dr.', 'Mme.',
       'Ms.', 'Major.', 'Lady.', 'Sir.', 'Mlle.', 'Col.', 'Capt.', 'the',
       'Jonkheer.'], dtype=object)
df['Title'].value_counts()
Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Major.         2
Col.           2
Mlle.          2
Lady.          1
Capt.          1
Jonkheer.      1
Don.           1
Ms.            1
the            1
Mme.           1
Sir.           1
Name: Title, dtype: int64

8.3. Pass Additional Values#

You can even use things that pass additional values.

RECODE_MRS=['Lady.','Mme.']
RECODE_MISS=['Ms.']
RECODE_MR=['Sir.','the','Don.','Jonkheer.','Capt.']
def get_title2(x,recode_mrs, recode_miss, recode_mr):
  
  x = str(x)
  x = x.split(',') #Split at the comma
  x = x[1].strip() #remove any leading spaces
  x = x.split(' ')#Split at the spaces
  x = x[0] #select the first word. 
  if x in recode_mrs:
    x='Mrs.'
  elif x in recode_miss:
    x='Miss.'
  elif x in recode_mr:
    x='Mr.'
  return x

#Always test your function with a single entry, not the apply.
get_title('Dooley, Sir., Patrick', recode_mrs=RECODE_MRS, recode_miss=RECODE_MISS, recode_mr=RECODE_MR)
'Mr.'
df['Title']=df['Name'].apply(get_title2,recode_mrs=RECODE_MRS, recode_miss=RECODE_MISS, recode_mr=RECODE_MR )
df[['Name','Title']]
Name Title
0 Braund, Mr. Owen Harris Mr.
1 Cumings, Mrs. John Bradley (Florence Briggs Th... Mrs.
2 Heikkinen, Miss. Laina Miss.
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs.
4 Allen, Mr. William Henry Mr.
... ... ...
886 Montvila, Rev. Juozas Rev.
887 Graham, Miss. Margaret Edith Miss.
888 Johnston, Miss. Catherine Helen "Carrie" Miss.
889 Behr, Mr. Karl Howell Mr.
890 Dooley, Mr. Patrick Mr.

891 rows × 2 columns

df['Title'].value_counts()
Mr.        521
Miss.      183
Mrs.       127
Master.     40
Dr.          7
Rev.         6
Mlle.        2
Major.       2
Col.         2
Capt.        1
Name: Title, dtype: int64

8.4. Using Values from more than one column#

  • Apply somethign on the entire dataframe if calcs involve more than once column.

def complex_process(row):
  
  return row['Sex']+str(row['Age'])

df.apply(complex_process, axis = 1)
0        male22.0
1      female38.0
2      female26.0
3      female35.0
4        male35.0
          ...    
886      male27.0
887    female19.0
888     femalenan
889      male26.0
890      male32.0
Length: 891, dtype: object