{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xRlGzpOI8eO0"
   },
   "source": [
    "\n",
    "[![AnalyticsDojo](https://github.com/rpi-techfundamentals/spring2019-materials/blob/master/fig/final-logo.png?raw=1)](http://rpi.analyticsdojo.com)\n",
    "<center><h1>Titanic Classification</h1></center>\n",
    "<center><h3><a href = 'http://introml.analyticsdojo.com'>introml.analyticsdojo.com</a></h3></center>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Titanic Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7pW1UhJT8ePk"
   },
   "source": [
    "As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bvj3Wids8ePm",
    "outputId": "4ca83181-968f-4ba5-e8cf-6129f88f554b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n",
      "       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
      "      dtype='object') Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',\n",
      "       'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "train = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')\n",
    "test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')\n",
    "\n",
    "print(train.columns, test.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0xqjk2-P8ePp"
   },
   "source": [
    "Here is a broad description of the keys and what they mean:\n",
    "\n",
    "```\n",
    "pclass          Passenger Class\n",
    "                (1 = 1st; 2 = 2nd; 3 = 3rd)\n",
    "survival        Survival\n",
    "                (0 = No; 1 = Yes)\n",
    "name            Name\n",
    "sex             Sex\n",
    "age             Age\n",
    "sibsp           Number of Siblings/Spouses Aboard\n",
    "parch           Number of Parents/Children Aboard\n",
    "ticket          Ticket Number\n",
    "fare            Passenger Fare\n",
    "cabin           Cabin\n",
    "embarked        Port of Embarkation\n",
    "                (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
    "boat            Lifeboat\n",
    "body            Body Identification Number\n",
    "home.dest       Home/Destination\n",
    "```\n",
    "\n",
    "In general, it looks like `name`, `sex`, `cabin`, `embarked`, `boat`, `body`, and `homedest` may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 280
    },
    "id": "bqmMR9G78ePr",
    "outputId": "b1ca97e9-9196-4790-9d1c-fc98d74d30d1"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked\n",
       "0            1         0       3  ...   7.2500   NaN         S\n",
       "1            2         1       1  ...  71.2833   C85         C\n",
       "2            3         1       3  ...   7.9250   NaN         S\n",
       "3            4         1       1  ...  53.1000  C123         S\n",
       "4            5         0       3  ...   8.0500   NaN         S\n",
       "\n",
       "[5 rows x 12 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "54WY6zD78ePv"
   },
   "source": [
    "### Preprocessing function\n",
    "\n",
    "We want to create a preprocessing function that can address transformation of our train and test set.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "6hL63MtX8ePz",
    "outputId": "ed524183-5dca-4643-91aa-69329f86e1ad"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total missing values before processing: 179\n",
      "Total missing values after processing: 0\n",
      "Total missing values before processing: 87\n",
      "Total missing values after processing: 0\n"
     ]
    }
   ],
   "source": [
    "from sklearn.impute import SimpleImputer\n",
    "import numpy as np\n",
    "\n",
    "cat_features = ['Pclass', 'Sex', 'Embarked']\n",
    "num_features =  [ 'Age', 'SibSp', 'Parch', 'Fare'  ]\n",
    "def preprocess(df, num_features, cat_features, dv):\n",
    "    features = cat_features + num_features\n",
    "    if dv in df.columns:\n",
    "      y = df[dv]\n",
    "    else:\n",
    "      y=None \n",
    "    #Address missing variables\n",
    "    print(\"Total missing values before processing:\", df[features].isna().sum().sum() )\n",
    "  \n",
    "    imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')\n",
    "    df[cat_features]=imp_mode.fit_transform(df[cat_features] )\n",
    "    imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
    "    df[num_features]=imp_mean.fit_transform(df[num_features])\n",
    "    print(\"Total missing values after processing:\", df[features].isna().sum().sum() )\n",
    "   \n",
    "    X = pd.get_dummies(df[features], columns=cat_features, drop_first=True)\n",
    "    return y,X\n",
    "\n",
    "y, X =  preprocess(train, num_features, cat_features, 'Survived')\n",
    "test_y, test_X = preprocess(test, num_features, cat_features, 'Survived')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 447
    },
    "id": "ssoaorx6qyse",
    "outputId": "01f8dd08-4071-44b2-c707-6f2edf087cdc"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Pclass_2</th>\n",
       "      <th>Pclass_3</th>\n",
       "      <th>Sex_male</th>\n",
       "      <th>Embarked_Q</th>\n",
       "      <th>Embarked_S</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>22.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>38.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>26.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>35.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>35.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>27.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>19.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>29.699118</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>23.4500</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>26.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>32.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>891 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Age  SibSp  Parch  ...  Sex_male  Embarked_Q  Embarked_S\n",
       "0    22.000000    1.0    0.0  ...         1           0           1\n",
       "1    38.000000    1.0    0.0  ...         0           0           0\n",
       "2    26.000000    0.0    0.0  ...         0           0           1\n",
       "3    35.000000    1.0    0.0  ...         0           0           1\n",
       "4    35.000000    0.0    0.0  ...         1           0           1\n",
       "..         ...    ...    ...  ...       ...         ...         ...\n",
       "886  27.000000    0.0    0.0  ...         1           0           1\n",
       "887  19.000000    0.0    0.0  ...         0           0           1\n",
       "888  29.699118    1.0    2.0  ...         0           0           1\n",
       "889  26.000000    0.0    0.0  ...         1           0           0\n",
       "890  32.000000    0.0    0.0  ...         1           1           0\n",
       "\n",
       "[891 rows x 9 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "id": "icKFkwZQpvCs"
   },
   "outputs": [],
   "source": [
    "#Import Module\n",
    "from sklearn.model_selection import train_test_split\n",
    "train_X, val_X, train_y, val_y = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=122, stratify=y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "id": "jGoUxc7brPIg"
   },
   "outputs": [],
   "source": [
    "from sklearn.neural_network import MLPClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.gaussian_process import GaussianProcessClassifier\n",
    "from sklearn.gaussian_process.kernels import RBF\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\n",
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "6kHwslmYrcRw",
    "outputId": "f2bcd29c-8da2-49a5-dcf5-fc5457dd7e0f"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Metrics score train:  0.7447833065810594\n",
      "Metrics score validation:  0.7126865671641791\n"
     ]
    }
   ],
   "source": [
    "classifier = KNeighborsClassifier(n_neighbors=10)\n",
    "#This fits the model object to the data.\n",
    "classifier.fit(train_X, train_y)\n",
    "#This creates the prediction. \n",
    "train_y_pred = classifier.predict(train_X)\n",
    "val_y_pred = classifier.predict(val_X)\n",
    "test_y_pred = classifier.predict(test_X)\n",
    "print(\"Metrics score train: \", metrics.accuracy_score(train_y, train_y_pred) )\n",
    "print(\"Metrics score validation: \", metrics.accuracy_score(val_y, val_y_pred) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "id": "Fqcyco3ivotP"
   },
   "outputs": [],
   "source": [
    "test['Survived']=classifier.predict(test_X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 60
    },
    "id": "VtjkfeO1wsw8",
    "outputId": "66e977cd-5e11-4e35-e239-3dd7796c8514"
   },
   "outputs": [
    {
     "data": {
      "application/javascript": [
       "\n",
       "    async function download(id, filename, size) {\n",
       "      if (!google.colab.kernel.accessAllowed) {\n",
       "        return;\n",
       "      }\n",
       "      const div = document.createElement('div');\n",
       "      const label = document.createElement('label');\n",
       "      label.textContent = `Downloading \"${filename}\": `;\n",
       "      div.appendChild(label);\n",
       "      const progress = document.createElement('progress');\n",
       "      progress.max = size;\n",
       "      div.appendChild(progress);\n",
       "      document.body.appendChild(div);\n",
       "\n",
       "      const buffers = [];\n",
       "      let downloaded = 0;\n",
       "\n",
       "      const channel = await google.colab.kernel.comms.open(id);\n",
       "      // Send a message to notify the kernel that we're ready.\n",
       "      channel.send({})\n",
       "\n",
       "      for await (const message of channel.messages) {\n",
       "        // Send a message to notify the kernel that we're ready.\n",
       "        channel.send({})\n",
       "        if (message.buffers) {\n",
       "          for (const buffer of message.buffers) {\n",
       "            buffers.push(buffer);\n",
       "            downloaded += buffer.byteLength;\n",
       "            progress.value = downloaded;\n",
       "          }\n",
       "        }\n",
       "      }\n",
       "      const blob = new Blob(buffers, {type: 'application/binary'});\n",
       "      const a = document.createElement('a');\n",
       "      a.href = window.URL.createObjectURL(blob);\n",
       "      a.download = filename;\n",
       "      div.appendChild(a);\n",
       "      a.click();\n",
       "      div.remove();\n",
       "    }\n",
       "  "
      ],
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/javascript": [
       "download(\"download_73e99ba6-2df4-4c2b-8ffb-94f2a74619e4\", \"submission.csv\", 4402)"
      ],
      "text/plain": [
       "<IPython.core.display.Javascript object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "test[['PassengerId','Survived']].to_csv('submission.csv')\n",
    "from google.colab import files\n",
    "files.download('submission.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5JrJQAqMwJY5"
   },
   "source": [
    "## Challenge\n",
    "Create a function that can accept any Scikit learn model and assess the perfomance in the validation set, storing results as a dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "rXRikRZwvNMO"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Copy of 05_features_dummies.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}