Titanic Regression
introml.analyticsdojo.com
28. Titanic Regression#
Here we are going to create a model for our age variable.
import os
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')
print(train.columns, test.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object') Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Here is a broad description of the keys and what they mean:
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival
(0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
In general, it looks like name
, sex
, cabin
, embarked
, boat
, body
, and homedest
may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
#Set our train and test based on the missing values.
atest=train.loc[train['Age'].isnull(),:]
atrain_temp=train.loc[train['Age'].notnull(),:]
atrain_temp
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
714 rows × 12 columns
28.1. Preprocessing function#
We want to create a preprocessing function that can address transformation of our train and test set.
from sklearn.impute import SimpleImputer
import numpy as np
cat_features = ['Pclass', 'Sex', 'Embarked']
num_features = [ 'SibSp', 'Parch', 'Fare' ]
def preprocess(df, num_features, cat_features, dv):
features = cat_features + num_features
if dv in df.columns:
y = df[dv]
else:
y=None
#Address missing variables
print("Total missing values before processing:", df[features].isna().sum().sum() )
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df[cat_features]=imp_mode.fit_transform(df[cat_features] )
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
df[num_features]=imp_mean.fit_transform(df[num_features])
print("Total missing values after processing:", df[features].isna().sum().sum() )
X = pd.get_dummies(df[features], columns=cat_features, drop_first=True)
return y,X
atrain_y, atrain_X = preprocess(atrain_temp, num_features, cat_features, 'Age')
#test_y, test_X = preprocess(atest, num_features, cat_features, 'Survived')
Total missing values before processing: 2
Total missing values after processing: 0
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:3678: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[col] = igetitem(value, i)
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:3678: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[col] = igetitem(value, i)
atrain_X
SibSp | Parch | Fare | Pclass_2 | Pclass_3 | Sex_male | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 7.2500 | 0 | 1 | 1 | 0 | 1 |
1 | 1.0 | 0.0 | 71.2833 | 0 | 0 | 0 | 0 | 0 |
2 | 0.0 | 0.0 | 7.9250 | 0 | 1 | 0 | 0 | 1 |
3 | 1.0 | 0.0 | 53.1000 | 0 | 0 | 0 | 0 | 1 |
4 | 0.0 | 0.0 | 8.0500 | 0 | 1 | 1 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
885 | 0.0 | 5.0 | 29.1250 | 0 | 1 | 0 | 1 | 0 |
886 | 0.0 | 0.0 | 13.0000 | 1 | 0 | 1 | 0 | 1 |
887 | 0.0 | 0.0 | 30.0000 | 0 | 0 | 0 | 0 | 1 |
889 | 0.0 | 0.0 | 30.0000 | 0 | 0 | 1 | 0 | 0 |
890 | 0.0 | 0.0 | 7.7500 | 0 | 1 | 1 | 1 | 0 |
714 rows × 8 columns
#Import Module
from sklearn.model_selection import train_test_split
atrain_X, aval_X, atrain_y, aval_y = train_test_split(atrain_X, atrain_y, train_size=0.6, test_size=0.4, random_state=122, stratify = atrain_X['Sex_male'])
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV, Ridge, RidgeCV, ElasticNet, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
reg=LinearRegression()
reg.fit(atrain_X,atrain_y)
print('Coefficients: \n', reg.coef_)
print('Intercept: \n', reg.intercept_)
print('R2 for Train)', reg.score( atrain_X, atrain_y ))
print('R2 for Test (cross validation)', reg.score(aval_X, aval_y))
Coefficients:
[-4.29741890e+00 -2.67412991e-01 -9.49154983e-03 -9.87322021e+00
-1.47939030e+01 2.19619426e+00 5.43658582e+00 3.84105821e+00]
Intercept:
37.88921436586399
R2 for Train) 0.26599806605544163
R2 for Test (cross validation) 0.21284235213769864
def evaluate(name, dtype, y_true, y_pred, results=pd.Series(dtype=float)):
"""
This creates a Pandas series with different results.
"""
results['name']=name
results['r2-'+dtype]=metrics.r2_score(y_true, y_pred)
return results
def model(name, regressor, train_X, train_y, val_X, val_y):
"""
This will train and evaluate a classifier.
"""
regressor.fit(train_X, train_y)
#This creates the prediction.
r1= evaluate(name, "train", train_y, regressor.predict(train_X))
r1= evaluate(name,"validation", val_y, regressor.predict(val_X), results=r1)
return r1
final=pd.DataFrame()
allmodels={"linear": LinearRegression(),
"linear": LinearRegression(),
"gradient": GradientBoostingRegressor()}
for key, value in allmodels.items():
print("Modeling: ", key, "...")
#atrain_X, aval_X, atrain_y, aval_y
results= model(key, value, atrain_X, atrain_y, aval_X, aval_y)
final=final.append(results, ignore_index=True)
#final_order=['name','accuracy-train', 'accuracy-validation', 'auc-train', 'auc-validation','recall-train', 'recall-validation']
#final=final.loc[:,final_order]
final
Modeling: linear ...
Modeling: gradient ...
name | r2-train | r2-validation | |
---|---|---|---|
0 | linear | 0.265998 | 0.212842 |
1 | gradient | 0.573295 | 0.236151 |
28.2. Challenge#
Run different levels of regularization and see what works best.