Titanic Regression

introml.analyticsdojo.com

28. Titanic Regression#

Here we are going to create a model for our age variable.

import os
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')

print(train.columns, test.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object') Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Here is a broad description of the keys and what they mean:

pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
survival        Survival
                (0 = No; 1 = Yes)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)
boat            Lifeboat
body            Body Identification Number
home.dest       Home/Destination

In general, it looks like name, sex, cabin, embarked, boat, body, and homedest may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

#Set our train and test based on the missing values. 
atest=train.loc[train['Age'].isnull(),:]
atrain_temp=train.loc[train['Age'].notnull(),:]

atrain_temp

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

714 rows × 12 columns

28.1. Preprocessing function#

We want to create a preprocessing function that can address transformation of our train and test set.

from sklearn.impute import SimpleImputer
import numpy as np

cat_features = ['Pclass', 'Sex', 'Embarked']
num_features =  [ 'SibSp', 'Parch', 'Fare'  ]
def preprocess(df, num_features, cat_features, dv):
    features = cat_features + num_features
    if dv in df.columns:
      y = df[dv]
    else:
      y=None 
    #Address missing variables
    print("Total missing values before processing:", df[features].isna().sum().sum() )
  
    imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    df[cat_features]=imp_mode.fit_transform(df[cat_features] )
    imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
    df[num_features]=imp_mean.fit_transform(df[num_features])
    print("Total missing values after processing:", df[features].isna().sum().sum() )
   
    X = pd.get_dummies(df[features], columns=cat_features, drop_first=True)
    return y,X

atrain_y, atrain_X =  preprocess(atrain_temp, num_features, cat_features, 'Age')
#test_y, test_X = preprocess(atest, num_features, cat_features, 'Survived')

Total missing values before processing: 2
Total missing values after processing: 0

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:3678: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[col] = igetitem(value, i)
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:3678: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[col] = igetitem(value, i)

atrain_X

	SibSp	Parch	Fare	Pclass_2	Pclass_3	Sex_male	Embarked_Q	Embarked_S
0	1.0	0.0	7.2500	0	1	1	0	1
1	1.0	0.0	71.2833	0	0	0	0	0
2	0.0	0.0	7.9250	0	1	0	0	1
3	1.0	0.0	53.1000	0	0	0	0	1
4	0.0	0.0	8.0500	0	1	1	0	1
...	...	...	...	...	...	...	...	...
885	0.0	5.0	29.1250	0	1	0	1	0
886	0.0	0.0	13.0000	1	0	1	0	1
887	0.0	0.0	30.0000	0	0	0	0	1
889	0.0	0.0	30.0000	0	0	1	0	0
890	0.0	0.0	7.7500	0	1	1	1	0

714 rows × 8 columns

#Import Module
from sklearn.model_selection import train_test_split
atrain_X, aval_X, atrain_y, aval_y = train_test_split(atrain_X, atrain_y, train_size=0.6, test_size=0.4, random_state=122, stratify = atrain_X['Sex_male'])

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV, Ridge, RidgeCV, ElasticNet, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

reg=LinearRegression()
reg.fit(atrain_X,atrain_y)

print('Coefficients: \n', reg.coef_)
print('Intercept: \n', reg.intercept_)
print('R2 for Train)', reg.score( atrain_X, atrain_y ))
print('R2 for Test (cross validation)', reg.score(aval_X, aval_y))

Coefficients: 
 [-4.29741890e+00 -2.67412991e-01 -9.49154983e-03 -9.87322021e+00
 -1.47939030e+01  2.19619426e+00  5.43658582e+00  3.84105821e+00]
Intercept: 
 37.88921436586399
R2 for Train) 0.26599806605544163
R2 for Test (cross validation) 0.21284235213769864

def evaluate(name, dtype, y_true, y_pred, results=pd.Series(dtype=float)):
  """
  This creates a Pandas series with different results. 
  """
  results['name']=name
  results['r2-'+dtype]=metrics.r2_score(y_true, y_pred)
  return results


def model(name, regressor, train_X, train_y, val_X, val_y):
  """
  This will train and evaluate a classifier. 
  """
  regressor.fit(train_X, train_y)
  #This creates the prediction. 
  r1= evaluate(name, "train", train_y, regressor.predict(train_X))
  r1= evaluate(name,"validation", val_y, regressor.predict(val_X),  results=r1)
  return r1

final=pd.DataFrame()
allmodels={"linear": LinearRegression(),
           "linear": LinearRegression(),
           "gradient": GradientBoostingRegressor()}

for key, value in  allmodels.items():
  print("Modeling: ", key, "...")
  #atrain_X, aval_X, atrain_y, aval_y
  results= model(key, value, atrain_X, atrain_y, aval_X, aval_y)
  final=final.append(results, ignore_index=True)
#final_order=['name','accuracy-train', 'accuracy-validation', 'auc-train', 'auc-validation','recall-train', 'recall-validation']
#final=final.loc[:,final_order]
final

Modeling:  linear ...
Modeling:  gradient ...

	name	r2-train	r2-validation
0	linear	0.265998	0.212842
1	gradient	0.573295	0.236151

28.2. Challenge#

Run different levels of regularization and see what works best.

Titanic Regression

Contents

Titanic Regression

introml.analyticsdojo.com

28. Titanic Regression#

28.1. Preprocessing function#

28.2. Challenge#