Titanic Classification - Challenge Solutions
introml.analyticsdojo.com
25. Titanic Classification - Challenge Solution#
As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.
import os
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')
print(train.columns, test.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object') Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Here is a broad description of the keys and what they mean:
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival
(0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
In general, it looks like name
, sex
, cabin
, embarked
, boat
, body
, and homedest
may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
25.1. Preprocessing function#
We want to create a preprocessing function that can address transformation of our train and test set.
from sklearn.impute import SimpleImputer
import numpy as np
cat_features = ['Pclass', 'Sex', 'Embarked']
num_features = [ 'Age', 'SibSp', 'Parch', 'Fare' ]
def preprocess(df, num_features, cat_features, dv):
features = cat_features + num_features
if dv in df.columns:
y = df[dv]
else:
y=None
#Address missing variables
print("Total missing values before processing:", df[features].isna().sum().sum() )
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df[cat_features]=imp_mode.fit_transform(df[cat_features] )
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
df[num_features]=imp_mean.fit_transform(df[num_features])
print("Total missing values after processing:", df[features].isna().sum().sum() )
X = pd.get_dummies(df[features], columns=cat_features, drop_first=True)
return y,X
y, X = preprocess(train, num_features, cat_features, 'Survived')
test_y, test_X = preprocess(test, num_features, cat_features, 'Survived')
Total missing values before processing: 179
Total missing values after processing: 0
Total missing values before processing: 87
Total missing values after processing: 0
25.2. Train Test Split#
Now we are ready to model. We are going to separate our Kaggle given data into a “Train” and a “Validation” set.
#Import Module
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=122,stratify=y)
print(train_y.mean(), val_y.mean())
0.38362760834670945 0.3843283582089552
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn import metrics
from sklearn import tree
classifier = tree.DecisionTreeClassifier(max_depth=4)
#This fits the model object to the data.
classifier.fit(train_X, train_y)
#This creates the prediction.
train_y_pred = classifier.predict(train_X)
val_y_pred = classifier.predict(val_X)
test['Survived'] = classifier.predict(test_X)
print("Metrics score train: ", metrics.accuracy_score(train_y, train_y_pred) )
print("Metrics score validation: ", metrics.accuracy_score(val_y, val_y_pred) )
Metrics score train: 0.8202247191011236
Metrics score validation: 0.8432835820895522
print("Metrics score train: ", metrics.recall_score(train_y, train_y_pred) )
print("Metrics score validation: ", metrics.recall_score(val_y, val_y_pred) )
Metrics score train: 0.698744769874477
Metrics score validation: 0.7572815533980582
25.3. Outputting Probabilities#
Some evaluation metrics (like the Area Under the Receiver Operating Characteristic Curve (ROC AUC) take the probability rather than the class which is output by the model.
The function predict_proba
outputs the probability of each class. Here, we want only the second value which is the probability of survived.
When working with a new evaluation metric, always check to see whether it takes the probability or the class.
train_y_pred_prob = classifier.predict_proba(train_X)[:,1]
val_y_pred_prob = classifier.predict_proba(val_X)[:,1]
test_y_pred_prob = classifier.predict_proba(test_X)[:,1]
print("Metrics score train: ", metrics.roc_auc_score(train_y, train_y_pred_prob) )
print("Metrics score validation: ", metrics.roc_auc_score(val_y, val_y_pred_prob) )
Metrics score train: 0.8719763336820084
Metrics score validation: 0.8686672550750221
test[['PassengerId','Survived']].to_csv('submission.csv')
from google.colab import files
files.download('submission.csv')
25.4. Challenge#
Create a function that can accept any Scikit learn model and assess the perfomance in the validation set, storing results as a dataframe.
#Function Definition
def evaluate(name, dtype, y_true, y_pred, y_prob, results=pd.Series(dtype=float)):
"""
This creates a Pandas series with different results.
"""
results['name']=name
results['accuracy-'+dtype]=metrics.accuracy_score(y_true, y_pred)
results['recall-'+dtype]=metrics.recall_score(y_true, y_pred)
results['auc-'+dtype]=metrics.roc_auc_score(y_true, y_prob)
return results
def model(name, classifier, train_X, train_y, val_X, val_y):
"""
This will train and evaluate a classifier.
"""
classifier.fit(train_X, train_y)
#This creates the prediction.
r1= evaluate(name, "train", train_y, classifier.predict(train_X), classifier.predict_proba(train_X)[:,1])
r1= evaluate(name,"validation", val_y, classifier.predict(val_X), classifier.predict_proba(val_X)[:,1], results=r1)
return r1
25.5. Analyze Multiple Models#
This code will model all values which are in the dictionary.
final=pd.DataFrame()
allmodels={"knearest": KNeighborsClassifier(n_neighbors=10),
"adaboost":AdaBoostClassifier()}
for key, value in allmodels.items():
print("Modeling: ", key, "...")
results= model(key, value, train_X, train_y, val_X, val_y)
final=final.append(results, ignore_index=True)
final_order=['name','accuracy-train', 'accuracy-validation', 'auc-train', 'auc-validation','recall-train', 'recall-validation']
final=final.loc[:,final_order]
final
Modeling: knearest ...
Modeling: adaboost ...
name | accuracy-train | accuracy-validation | auc-train | auc-validation | recall-train | recall-validation | |
---|---|---|---|---|---|---|---|
0 | knearest | 0.744783 | 0.712687 | 0.809564 | 0.781642 | 0.506276 | 0.436893 |
1 | adaboost | 0.821830 | 0.817164 | 0.896977 | 0.880229 | 0.744770 | 0.766990 |
25.5.1. Challenge#
Augment the modeling to include Random Forests at multiple different hyperparameter levels.
Augment the evaluation to include Balanced Accuracy and F1 score.