{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "xRlGzpOI8eO0" }, "source": [ "\n", "[![AnalyticsDojo](https://github.com/rpi-techfundamentals/spring2019-materials/blob/master/fig/final-logo.png?raw=1)](http://rpi.analyticsdojo.com)\n", "

Titanic Classification

\n", "

introml.analyticsdojo.com

\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Titanic Classification" ] }, { "cell_type": "markdown", "metadata": { "id": "7pW1UhJT8ePk" }, "source": [ "As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bvj3Wids8ePm", "outputId": "4ca83181-968f-4ba5-e8cf-6129f88f554b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n", " 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n", " dtype='object') Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',\n", " 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n", " dtype='object')\n" ] } ], "source": [ "import os\n", "import pandas as pd\n", "train = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')\n", "test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')\n", "\n", "print(train.columns, test.columns)" ] }, { "cell_type": "markdown", "metadata": { "id": "0xqjk2-P8ePp" }, "source": [ "Here is a broad description of the keys and what they mean:\n", "\n", "```\n", "pclass Passenger Class\n", " (1 = 1st; 2 = 2nd; 3 = 3rd)\n", "survival Survival\n", " (0 = No; 1 = Yes)\n", "name Name\n", "sex Sex\n", "age Age\n", "sibsp Number of Siblings/Spouses Aboard\n", "parch Number of Parents/Children Aboard\n", "ticket Ticket Number\n", "fare Passenger Fare\n", "cabin Cabin\n", "embarked Port of Embarkation\n", " (C = Cherbourg; Q = Queenstown; S = Southampton)\n", "boat Lifeboat\n", "body Body Identification Number\n", "home.dest Home/Destination\n", "```\n", "\n", "In general, it looks like `name`, `sex`, `cabin`, `embarked`, `boat`, `body`, and `homedest` may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 280 }, "id": "bqmMR9G78ePr", "outputId": "b1ca97e9-9196-4790-9d1c-fc98d74d30d1" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass ... Fare Cabin Embarked\n", "0 1 0 3 ... 7.2500 NaN S\n", "1 2 1 1 ... 71.2833 C85 C\n", "2 3 1 3 ... 7.9250 NaN S\n", "3 4 1 1 ... 53.1000 C123 S\n", "4 5 0 3 ... 8.0500 NaN S\n", "\n", "[5 rows x 12 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "54WY6zD78ePv" }, "source": [ "### Preprocessing function\n", "\n", "We want to create a preprocessing function that can address transformation of our train and test set. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6hL63MtX8ePz", "outputId": "ed524183-5dca-4643-91aa-69329f86e1ad" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total missing values before processing: 179\n", "Total missing values after processing: 0\n", "Total missing values before processing: 87\n", "Total missing values after processing: 0\n" ] } ], "source": [ "from sklearn.impute import SimpleImputer\n", "import numpy as np\n", "\n", "cat_features = ['Pclass', 'Sex', 'Embarked']\n", "num_features = [ 'Age', 'SibSp', 'Parch', 'Fare' ]\n", "def preprocess(df, num_features, cat_features, dv):\n", " features = cat_features + num_features\n", " if dv in df.columns:\n", " y = df[dv]\n", " else:\n", " y=None \n", " #Address missing variables\n", " print(\"Total missing values before processing:\", df[features].isna().sum().sum() )\n", " \n", " imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')\n", " df[cat_features]=imp_mode.fit_transform(df[cat_features] )\n", " imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')\n", " df[num_features]=imp_mean.fit_transform(df[num_features])\n", " print(\"Total missing values after processing:\", df[features].isna().sum().sum() )\n", " \n", " X = pd.get_dummies(df[features], columns=cat_features, drop_first=True)\n", " return y,X\n", "\n", "y, X = preprocess(train, num_features, cat_features, 'Survived')\n", "test_y, test_X = preprocess(test, num_features, cat_features, 'Survived')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 447 }, "id": "ssoaorx6qyse", "outputId": "01f8dd08-4071-44b2-c707-6f2edf087cdc" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeSibSpParchFarePclass_2Pclass_3Sex_maleEmbarked_QEmbarked_S
022.0000001.00.07.250001101
138.0000001.00.071.283300000
226.0000000.00.07.925001001
335.0000001.00.053.100000001
435.0000000.00.08.050001101
..............................
88627.0000000.00.013.000010101
88719.0000000.00.030.000000001
88829.6991181.02.023.450001001
88926.0000000.00.030.000000100
89032.0000000.00.07.750001110
\n", "

891 rows × 9 columns

\n", "
" ], "text/plain": [ " Age SibSp Parch ... Sex_male Embarked_Q Embarked_S\n", "0 22.000000 1.0 0.0 ... 1 0 1\n", "1 38.000000 1.0 0.0 ... 0 0 0\n", "2 26.000000 0.0 0.0 ... 0 0 1\n", "3 35.000000 1.0 0.0 ... 0 0 1\n", "4 35.000000 0.0 0.0 ... 1 0 1\n", ".. ... ... ... ... ... ... ...\n", "886 27.000000 0.0 0.0 ... 1 0 1\n", "887 19.000000 0.0 0.0 ... 0 0 1\n", "888 29.699118 1.0 2.0 ... 0 0 1\n", "889 26.000000 0.0 0.0 ... 1 0 0\n", "890 32.000000 0.0 0.0 ... 1 1 0\n", "\n", "[891 rows x 9 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "icKFkwZQpvCs" }, "outputs": [], "source": [ "#Import Module\n", "from sklearn.model_selection import train_test_split\n", "train_X, val_X, train_y, val_y = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=122, stratify=y)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "jGoUxc7brPIg" }, "outputs": [], "source": [ "from sklearn.neural_network import MLPClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.svm import SVC\n", "from sklearn.gaussian_process import GaussianProcessClassifier\n", "from sklearn.gaussian_process.kernels import RBF\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\n", "from sklearn import metrics" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6kHwslmYrcRw", "outputId": "f2bcd29c-8da2-49a5-dcf5-fc5457dd7e0f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Metrics score train: 0.7447833065810594\n", "Metrics score validation: 0.7126865671641791\n" ] } ], "source": [ "classifier = KNeighborsClassifier(n_neighbors=10)\n", "#This fits the model object to the data.\n", "classifier.fit(train_X, train_y)\n", "#This creates the prediction. \n", "train_y_pred = classifier.predict(train_X)\n", "val_y_pred = classifier.predict(val_X)\n", "test_y_pred = classifier.predict(test_X)\n", "print(\"Metrics score train: \", metrics.accuracy_score(train_y, train_y_pred) )\n", "print(\"Metrics score validation: \", metrics.accuracy_score(val_y, val_y_pred) )" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "Fqcyco3ivotP" }, "outputs": [], "source": [ "test['Survived']=classifier.predict(test_X)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 60 }, "id": "VtjkfeO1wsw8", "outputId": "66e977cd-5e11-4e35-e239-3dd7796c8514" }, "outputs": [ { "data": { "application/javascript": [ "\n", " async function download(id, filename, size) {\n", " if (!google.colab.kernel.accessAllowed) {\n", " return;\n", " }\n", " const div = document.createElement('div');\n", " const label = document.createElement('label');\n", " label.textContent = `Downloading \"${filename}\": `;\n", " div.appendChild(label);\n", " const progress = document.createElement('progress');\n", " progress.max = size;\n", " div.appendChild(progress);\n", " document.body.appendChild(div);\n", "\n", " const buffers = [];\n", " let downloaded = 0;\n", "\n", " const channel = await google.colab.kernel.comms.open(id);\n", " // Send a message to notify the kernel that we're ready.\n", " channel.send({})\n", "\n", " for await (const message of channel.messages) {\n", " // Send a message to notify the kernel that we're ready.\n", " channel.send({})\n", " if (message.buffers) {\n", " for (const buffer of message.buffers) {\n", " buffers.push(buffer);\n", " downloaded += buffer.byteLength;\n", " progress.value = downloaded;\n", " }\n", " }\n", " }\n", " const blob = new Blob(buffers, {type: 'application/binary'});\n", " const a = document.createElement('a');\n", " a.href = window.URL.createObjectURL(blob);\n", " a.download = filename;\n", " div.appendChild(a);\n", " a.click();\n", " div.remove();\n", " }\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "download(\"download_73e99ba6-2df4-4c2b-8ffb-94f2a74619e4\", \"submission.csv\", 4402)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "test[['PassengerId','Survived']].to_csv('submission.csv')\n", "from google.colab import files\n", "files.download('submission.csv')" ] }, { "cell_type": "markdown", "metadata": { "id": "5JrJQAqMwJY5" }, "source": [ "## Challenge\n", "Create a function that can accept any Scikit learn model and assess the perfomance in the validation set, storing results as a dataframe. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rXRikRZwvNMO" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "Copy of 05_features_dummies.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 1 }