AnalyticsDojo

Titanic Regression

introml.analyticsdojo.com

28. Titanic Regression#

Here we are going to create a model for our age variable.

import os
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')

print(train.columns, test.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object') Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Here is a broad description of the keys and what they mean:

pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
survival        Survival
                (0 = No; 1 = Yes)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)
boat            Lifeboat
body            Body Identification Number
home.dest       Home/Destination

In general, it looks like name, sex, cabin, embarked, boat, body, and homedest may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:

train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
#Set our train and test based on the missing values. 
atest=train.loc[train['Age'].isnull(),:]
atrain_temp=train.loc[train['Age'].notnull(),:]
atrain_temp
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

714 rows × 12 columns

28.1. Preprocessing function#

We want to create a preprocessing function that can address transformation of our train and test set.

from sklearn.impute import SimpleImputer
import numpy as np

cat_features = ['Pclass', 'Sex', 'Embarked']
num_features =  [ 'SibSp', 'Parch', 'Fare'  ]
def preprocess(df, num_features, cat_features, dv):
    features = cat_features + num_features
    if dv in df.columns:
      y = df[dv]
    else:
      y=None 
    #Address missing variables
    print("Total missing values before processing:", df[features].isna().sum().sum() )
  
    imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    df[cat_features]=imp_mode.fit_transform(df[cat_features] )
    imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
    df[num_features]=imp_mean.fit_transform(df[num_features])
    print("Total missing values after processing:", df[features].isna().sum().sum() )
   
    X = pd.get_dummies(df[features], columns=cat_features, drop_first=True)
    return y,X

atrain_y, atrain_X =  preprocess(atrain_temp, num_features, cat_features, 'Age')
#test_y, test_X = preprocess(atest, num_features, cat_features, 'Survived')
Total missing values before processing: 2
Total missing values after processing: 0
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:3678: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[col] = igetitem(value, i)
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:3678: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[col] = igetitem(value, i)
atrain_X
SibSp Parch Fare Pclass_2 Pclass_3 Sex_male Embarked_Q Embarked_S
0 1.0 0.0 7.2500 0 1 1 0 1
1 1.0 0.0 71.2833 0 0 0 0 0
2 0.0 0.0 7.9250 0 1 0 0 1
3 1.0 0.0 53.1000 0 0 0 0 1
4 0.0 0.0 8.0500 0 1 1 0 1
... ... ... ... ... ... ... ... ...
885 0.0 5.0 29.1250 0 1 0 1 0
886 0.0 0.0 13.0000 1 0 1 0 1
887 0.0 0.0 30.0000 0 0 0 0 1
889 0.0 0.0 30.0000 0 0 1 0 0
890 0.0 0.0 7.7500 0 1 1 1 0

714 rows × 8 columns

#Import Module
from sklearn.model_selection import train_test_split
atrain_X, aval_X, atrain_y, aval_y = train_test_split(atrain_X, atrain_y, train_size=0.6, test_size=0.4, random_state=122, stratify = atrain_X['Sex_male'])
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV, Ridge, RidgeCV, ElasticNet, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
reg=LinearRegression()
reg.fit(atrain_X,atrain_y)

print('Coefficients: \n', reg.coef_)
print('Intercept: \n', reg.intercept_)
print('R2 for Train)', reg.score( atrain_X, atrain_y ))
print('R2 for Test (cross validation)', reg.score(aval_X, aval_y))
Coefficients: 
 [-4.29741890e+00 -2.67412991e-01 -9.49154983e-03 -9.87322021e+00
 -1.47939030e+01  2.19619426e+00  5.43658582e+00  3.84105821e+00]
Intercept: 
 37.88921436586399
R2 for Train) 0.26599806605544163
R2 for Test (cross validation) 0.21284235213769864
def evaluate(name, dtype, y_true, y_pred, results=pd.Series(dtype=float)):
  """
  This creates a Pandas series with different results. 
  """
  results['name']=name
  results['r2-'+dtype]=metrics.r2_score(y_true, y_pred)
  return results


def model(name, regressor, train_X, train_y, val_X, val_y):
  """
  This will train and evaluate a classifier. 
  """
  regressor.fit(train_X, train_y)
  #This creates the prediction. 
  r1= evaluate(name, "train", train_y, regressor.predict(train_X))
  r1= evaluate(name,"validation", val_y, regressor.predict(val_X),  results=r1)
  return r1
final=pd.DataFrame()
allmodels={"linear": LinearRegression(),
           "linear": LinearRegression(),
           "gradient": GradientBoostingRegressor()}

for key, value in  allmodels.items():
  print("Modeling: ", key, "...")
  #atrain_X, aval_X, atrain_y, aval_y
  results= model(key, value, atrain_X, atrain_y, aval_X, aval_y)
  final=final.append(results, ignore_index=True)
#final_order=['name','accuracy-train', 'accuracy-validation', 'auc-train', 'auc-validation','recall-train', 'recall-validation']
#final=final.loc[:,final_order]
final
Modeling:  linear ...
Modeling:  gradient ...
name r2-train r2-validation
0 linear 0.265998 0.212842
1 gradient 0.573295 0.236151

28.2. Challenge#

Run different levels of regularization and see what works best.