Introduction to Apply Function

Contents

Introduction to Python - Introduction to Apply Function

rpi.analyticsdojo.com

8. Introduction to Apply Function#

Don’t loop over a dataframe.
Instead, us the apply function to process a function across each value.

import pandas as pd
df=pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

8.1. Make it easy with the lambda function.#

Create a value for Age-squared.

df['age-squared']=df['Age'].apply(lambda x: x**2)

8.2. Or define an entire function.#

Define a function to get the title from the name.
Always test your function with a single entry, not the apply.

def get_title(x):
  
  x = str(x)
  x = x.split(',') #Split at the comma
  x = x[1].strip() #remove any leading spaces
  x = x.split(' ')#Split at the spaces
  return x[0]

#Always test your function with a single entry, not the apply.
get_title('Dooley, Mr. Patrick')

'Mr.'

df['Title']=df['Name'].apply(get_title)
df[['Name','Title']]

	Name	Title
0	Braund, Mr. Owen Harris	Mr.
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	Mrs.
2	Heikkinen, Miss. Laina	Miss.
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	Mrs.
4	Allen, Mr. William Henry	Mr.
...	...	...
886	Montvila, Rev. Juozas	Rev.
887	Graham, Miss. Margaret Edith	Miss.
888	Johnston, Miss. Catherine Helen "Carrie"	Miss.
889	Behr, Mr. Karl Howell	Mr.
890	Dooley, Mr. Patrick	Mr.

891 rows × 2 columns

df['Title'].unique()

array(['Mr.', 'Mrs.', 'Miss.', 'Master.', 'Don.', 'Rev.', 'Dr.', 'Mme.',
       'Ms.', 'Major.', 'Lady.', 'Sir.', 'Mlle.', 'Col.', 'Capt.', 'the',
       'Jonkheer.'], dtype=object)

df['Title'].value_counts()

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Major.         2
Col.           2
Mlle.          2
Lady.          1
Capt.          1
Jonkheer.      1
Don.           1
Ms.            1
the            1
Mme.           1
Sir.           1
Name: Title, dtype: int64

8.3. Pass Additional Values#

You can even use things that pass additional values.

RECODE_MRS=['Lady.','Mme.']
RECODE_MISS=['Ms.']
RECODE_MR=['Sir.','the','Don.','Jonkheer.','Capt.']
def get_title2(x,recode_mrs, recode_miss, recode_mr):
  
  x = str(x)
  x = x.split(',') #Split at the comma
  x = x[1].strip() #remove any leading spaces
  x = x.split(' ')#Split at the spaces
  x = x[0] #select the first word. 
  if x in recode_mrs:
    x='Mrs.'
  elif x in recode_miss:
    x='Miss.'
  elif x in recode_mr:
    x='Mr.'
  return x

#Always test your function with a single entry, not the apply.
get_title('Dooley, Sir., Patrick', recode_mrs=RECODE_MRS, recode_miss=RECODE_MISS, recode_mr=RECODE_MR)

'Mr.'

df['Title']=df['Name'].apply(get_title2,recode_mrs=RECODE_MRS, recode_miss=RECODE_MISS, recode_mr=RECODE_MR )
df[['Name','Title']]

	Name	Title
0	Braund, Mr. Owen Harris	Mr.
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	Mrs.
2	Heikkinen, Miss. Laina	Miss.
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	Mrs.
4	Allen, Mr. William Henry	Mr.
...	...	...
886	Montvila, Rev. Juozas	Rev.
887	Graham, Miss. Margaret Edith	Miss.
888	Johnston, Miss. Catherine Helen "Carrie"	Miss.
889	Behr, Mr. Karl Howell	Mr.
890	Dooley, Mr. Patrick	Mr.

891 rows × 2 columns

df['Title'].value_counts()

Mr.        521
Miss.      183
Mrs.       127
Master.     40
Dr.          7
Rev.         6
Mlle.        2
Major.       2
Col.         2
Capt.        1
Name: Title, dtype: int64

8.4. Using Values from more than one column#

Apply somethign on the entire dataframe if calcs involve more than once column.

def complex_process(row):
  
  return row['Sex']+str(row['Age'])

df.apply(complex_process, axis = 1)

      male22.0
    female38.0
    female26.0
    female35.0
      male35.0
          ...    
    male27.0
  female19.0
   femalenan
    male26.0
    male32.0
Length: 891, dtype: object