\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-XovA71E3XFM"
},
"source": [
"# Titanic PCA"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7pW1UhJT8ePk"
},
"source": [
"As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "bvj3Wids8ePm",
"outputId": "3c075657-ff1a-424c-9757-3eb6ef1c2b18"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n",
" 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
" dtype='object') Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',\n",
" 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import os\n",
"import pandas as pd\n",
"train = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')\n",
"test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')\n",
"\n",
"print(train.columns, test.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0xqjk2-P8ePp"
},
"source": [
"Here is a broad description of the keys and what they mean:\n",
"\n",
"```\n",
"pclass Passenger Class\n",
" (1 = 1st; 2 = 2nd; 3 = 3rd)\n",
"survival Survival\n",
" (0 = No; 1 = Yes)\n",
"name Name\n",
"sex Sex\n",
"age Age\n",
"sibsp Number of Siblings/Spouses Aboard\n",
"parch Number of Parents/Children Aboard\n",
"ticket Ticket Number\n",
"fare Passenger Fare\n",
"cabin Cabin\n",
"embarked Port of Embarkation\n",
" (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
"boat Lifeboat\n",
"body Body Identification Number\n",
"home.dest Home/Destination\n",
"```\n",
"\n",
"In general, it looks like `name`, `sex`, `cabin`, `embarked`, `boat`, `body`, and `homedest` may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 548
},
"id": "bqmMR9G78ePr",
"outputId": "b8b2ab48-ac65-48a9-f7ff-4575b1d63f8d"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
PassengerId
\n",
"
Survived
\n",
"
Pclass
\n",
"
Name
\n",
"
Sex
\n",
"
Age
\n",
"
SibSp
\n",
"
Parch
\n",
"
Ticket
\n",
"
Fare
\n",
"
Cabin
\n",
"
Embarked
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
3
\n",
"
Braund, Mr. Owen Harris
\n",
"
male
\n",
"
22.0
\n",
"
1
\n",
"
0
\n",
"
A/5 21171
\n",
"
7.2500
\n",
"
NaN
\n",
"
S
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
1
\n",
"
1
\n",
"
Cumings, Mrs. John Bradley (Florence Briggs Th...
\n",
"
female
\n",
"
38.0
\n",
"
1
\n",
"
0
\n",
"
PC 17599
\n",
"
71.2833
\n",
"
C85
\n",
"
C
\n",
"
\n",
"
\n",
"
2
\n",
"
3
\n",
"
1
\n",
"
3
\n",
"
Heikkinen, Miss. Laina
\n",
"
female
\n",
"
26.0
\n",
"
0
\n",
"
0
\n",
"
STON/O2. 3101282
\n",
"
7.9250
\n",
"
NaN
\n",
"
S
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
1
\n",
"
1
\n",
"
Futrelle, Mrs. Jacques Heath (Lily May Peel)
\n",
"
female
\n",
"
35.0
\n",
"
1
\n",
"
0
\n",
"
113803
\n",
"
53.1000
\n",
"
C123
\n",
"
S
\n",
"
\n",
"
\n",
"
4
\n",
"
5
\n",
"
0
\n",
"
3
\n",
"
Allen, Mr. William Henry
\n",
"
male
\n",
"
35.0
\n",
"
0
\n",
"
0
\n",
"
373450
\n",
"
8.0500
\n",
"
NaN
\n",
"
S
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "54WY6zD78ePv"
},
"source": [
"### Preprocessing function\n",
"\n",
"We want to create a preprocessing function that can address transformation of our train and test set. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "FKX26KU34Ti6",
"outputId": "a9bd50ee-68c8-4ffb-f077-77c700aecbed"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total missing values before processing: 179\n",
"Total missing values after processing: 0\n",
"Total missing values before processing: 87\n",
"Total missing values after processing: 0\n"
]
}
],
"source": [
"from sklearn.impute import SimpleImputer\n",
"import numpy as np\n",
"\n",
"cat_features = ['Pclass', 'Sex', 'Embarked']\n",
"num_features = [ 'Age', 'SibSp', 'Parch', 'Fare' ]\n",
"\n",
"\n",
"def preprocess(df, num_features, cat_features, dv):\n",
" features = cat_features + num_features\n",
" if dv in df.columns:\n",
" y = df[dv]\n",
" else:\n",
" y=None \n",
" #Address missing variables\n",
" print(\"Total missing values before processing:\", df[features].isna().sum().sum() )\n",
" \n",
" imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')\n",
" df[cat_features]=imp_mode.fit_transform(df[cat_features] )\n",
" imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')\n",
" df[num_features]=imp_mean.fit_transform(df[num_features])\n",
" print(\"Total missing values after processing:\", df[features].isna().sum().sum() )\n",
" \n",
" X = pd.get_dummies(df[features], columns=cat_features, drop_first=True)\n",
" return y,X\n",
"\n",
"y, X = preprocess(train, num_features, cat_features, 'Survived')\n",
"test_y, test_X = preprocess(test, num_features, cat_features, 'Survived')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yIEMMxHGwEXG"
},
"source": [
"# PCA Analysis\n",
"\n",
"See [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html). \n",
"\n",
"You can incorporate PCA based on number of components or the variance explained. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "pRaYU2YCvyNw"
},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"pca = PCA(n_components=5)\n",
"pca.fit(X)\n",
"X2=pca.transform(X)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Q_NMT2q9v5e2",
"outputId": "ddfa29b0-f907-4fc8-ca84-a30731610608"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2.47107661e+03 1.67651481e+02 1.25165106e+00 4.73653673e-01\n",
" 3.18808533e-01]\n"
]
}
],
"source": [
"#This indicates the amount of variance explained by each of the principal components.\n",
"print(pca.explained_variance_)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "6P5EwsE5wROv"
},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"pca2 = PCA(n_components=0.97)\n",
"pca2.fit(X)\n",
"X3=pca2.transform(X)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "o7jTEHLBzIzw",
"outputId": "b29c6774-34c8-4a44-f03f-21824a1d319a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2471.07660618 167.65148116]\n"
]
}
],
"source": [
"print(pca2.explained_variance_)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "YUBZDEqxv8s1",
"outputId": "7d115fee-2c83-4b9e-cac0-fd3e8f31e6d5"
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1.00000000e+00, 1.90521771e-16],\n",
" [1.90521771e-16, 1.00000000e+00]])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cov_data = np.corrcoef(X3.T)\n",
"cov_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Elbow Plot and Kaisers Rule Cutoff\n",
"\n",
"[Here](https://docs.displayr.com/wiki/Kaiser_Rule) is a link to documentation of Kaisers Rule. \n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data passed Bartlett’s test for sphericity.\n",
"Performing PCA using rotation: quartimax factors: 4 and standardization: False\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from factor_analyzer import FactorAnalyzer\n",
"from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity\n",
"from sklearn.decomposition import PCA\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib\n",
"def scree_plot(eigvals):\n",
" fig = plt.figure(figsize=(8,5))\n",
" sing_vals = np.arange(len(eigvals)) + 1\n",
" plt.plot(sing_vals, eigvals, 'ro-', linewidth=2)\n",
" #####horizontal line\n",
" horiz_line_data = np.array([1 for i in range(len(sing_vals))])\n",
" plt.plot(sing_vals, horiz_line_data, 'r--')\n",
" plt.title('Scree Plot for PCA')\n",
" plt.xlabel('Principal Component')\n",
" plt.ylabel('Eigenvalue')\n",
" #I don't like the default legend so I typically make mine like below, e.g.\n",
" #with smaller fonts and a bit transparent so I do not cover up data, and make\n",
" #it moveable by the viewer in case upper-right is a bad place for it\n",
" leg = plt.legend(['Eigenvalues from PCA', 'Kaisers Rule Cutoff'], loc='best', borderpad=0.3,\n",
" shadow=False, prop=matplotlib.font_manager.FontProperties(size='small'),\n",
" markerscale=0.4)\n",
" leg.get_frame().set_alpha(0.4)\n",
"\n",
" #plt.savefig(os.path.join(save_dir / (name +'.jpg')))\n",
" return plt\n",
"\n",
"def pca_workflow(X, factors=-1, standardize=False, rotation='quartimax'):\n",
" \"\"\"\n",
" This will perform factor analysis, calculating the number of factors.\n",
" Printing scree plots, etc.\n",
" \"\"\"\n",
"\n",
" chi_square_value,p_value=calculate_bartlett_sphericity(X)\n",
"\n",
" if round(p_value,2)<=0.05:\n",
" print(\"Data passed Bartlett’s test for sphericity.\")\n",
" else:\n",
" print(\"Data failed Bartlett’s test for sphericity, use PCA with caution.\")\n",
" \n",
" #This is used to calculate\n",
" if factors ==-1:\n",
" fa = FactorAnalyzer(n_factors=X.shape[1], rotation=None, method='ml')\n",
" fa.fit_transform(X)\n",
" # Check Eigenvalues\n",
" ev, v = fa.get_eigenvalues()\n",
" #set the number of factors as where Eigenvalue > 1.0\n",
" factors = np.sum(ev>1.0)\n",
" print (\"Performing PCA using rotation:\", rotation, \" factors: \", factors, \"and standardization: \", standardize)\n",
" loading_cols=['F'+str(x+1) for x in range(factors)]\n",
" plot=scree_plot(ev)\n",
"\n",
" if standardize:\n",
" X = StandardScaler().fit_transform(X)\n",
"\n",
" fa = FactorAnalyzer(n_factors=factors, method='principal', rotation=rotation)\n",
" fa.fit(X)\n",
"\n",
" #Change it back to a dataframe.\n",
" results=pd.DataFrame(fa.transform(X),columns=loading_cols)\n",
" \n",
" return results\n",
"\n",
"X4= pca_workflow(X)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"