44. Homework-6#

44.1. Total number of points: 40#

45. Due date: Nov 10, 2022#

Before you submit this homework, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can discuss with others regarding the homework but all work must be your own.

This homework will test your knowledge on basics of Python. The Python notebooks shared will be helpful to solve these problems.

Steps to evaluate your solutions:

Step-1: Ensure you have installed Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)

Step-2: Open the Jupyter Notebook by first launching the anaconda software console

Step-3: Open the .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”

Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.

Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as .ipynb file format

Step-6: Go to lms.rpi.edu and upload your homework at the appropriate link to submit this homework.

46. Please note that for any question in this assignment you will receive points ONLY if your solution passes all the test cases including hidden testcases as well. So please make sure you try to think all possible scenarios before submitting your answers.#

  • Note that hidden tests are present to ensure you are not hardcoding.

  • If caught cheating:

    • you will receive a score of 0 for the 1st violation.

    • for repeated incidents, you will receive an automatic ‘F’ grade and will be reported to the dean of Lally School of Management.

import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/lmanikon/lmanikon.github.io/master/teaching/datasets/KaggleAdmissions.csv')
df.columns=['Serial No.', 'GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
       'LOR', 'CGPA', 'Research', 'Chance of Admit']
df.head()

47. Q1 [18 points]#

47.1. 1. Create a new dataframe df_sub, that is a copy of df. Drop Serial No. from df.#

47.2. 2. Standardize only these attributes in df_sub using the function RobustScaler():#

  • GRE Score, TOEFL Score

47.3. 3. Perform normalization only on these attributes in df_sub using the function StandardScaler()#

  • University Rating, SOP, LOR, CGPA, Research

47.4. Note that after steps 3 and 4, make sure you still have the transformed values saved in df_sub#

47.5. 4. Create a new column named Admit using original column Chance of Admit to create a discrete set of class labels using these conditions. Then drop Chance of Admit column from df_sub.#

  • Convert to 2, if Chance of Admit value is >= 0.65

  • 1, if Chance of Admit value is < 0.65

#Answering the above questions in the same order as listed

# YOUR CODE HERE
raise NotImplementedError()
#[6 points] Test cell-1
#DO NOT MODIFY/DELETE THIS CELL 
assert (len(df_sub))==400
assert (len(df_sub.columns))==8
assert (set(df_sub.columns))=={'LOR', 'SOP', 'CGPA', 'TOEFL Score', 'GRE Score', 'Research', 'Admit', 'University Rating'}
#[6 points] Test cell-2
#DO NOT MODIFY/DELETE THIS CELL 
assert (round(np.mean(df_sub['CGPA']), 2))==0.0
assert (round(np.std(df_sub['CGPA']), 2))==1.0
assert (round(np.mean(df_sub['TOEFL Score']), 2))==0.05
#[6 points] Test cell-3
#DO NOT MODIFY/DELETE THIS CELL 
assert (round(np.sum(df_sub['Admit']), 2))==687
assert (set(df_sub['Admit']))=={1, 2}
assert (len(df_sub['Admit'].loc[df_sub['Admit']==1]))==113

48. Q2 [15 points]#

48.1. 1. Split the data into X and y for feature columns and class label column respectively.#

  • Feature columns (X): GRE Score, TOEFL Score, University Rating, SOP, LOR, CGPA, Research

  • CLass label column (y): Admit

48.2. 2. Using X and y variables representing features and class labels, perform train_test_split operation to build training (X_train, y_train) and testing data (X_test, y_test).#

  • Use test_size=0.4, random_state=55 as the parameters for train_test_split() function.

48.3. 3. Train the randomforest classifier (initialized as variable rf) using these parameters: max_depth=7, random_state=23.#

  • Using the trained model rf to first compute accuracy score and assign it to variable acc_rf.

  • Then compute the impurity-based feature importances.

  • Append the names of these top-3 features to a list impFeatures. Please make sure you type the feature names exactly as in df_sub

48.4. 4. Train a K-Nearest Neighbors classifier (initialized as variable knn1 and knn2) using these parameters: n_neighbors=5 and n_neighbors=20 and the kd_tree algorithm.#

  • Using the trained models knn1 and knn2 on training data, compute the accuracy score using test data and assign that to variables acc_knn_5 and acc_knn_22 respectively for k=2 and k=22.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

#Answering the above questions in the same order as listed

impFeatures=[]

# YOUR CODE HERE
raise NotImplementedError()
#[8 points] Test cell-4
#DO NOT MODIFY/DELETE THIS CELL 
assert (rf.max_depth)==7
assert (rf.n_estimators)==100
#assert round(acc_rf,2)==0.9
assert round(acc_rf,2)<=0.9
assert set(impFeatures)=={'GRE Score', 'CGPA', 'TOEFL Score'}
#[6 points] Test cell-5
#DO NOT MODIFY/DELETE THIS CELL 
assert knn1.algorithm=="kd_tree"
assert knn1.algorithm=="kd_tree"
assert (round(acc_knn_5,2))<=0.87
assert (round(acc_knn_22, 2))<=0.9

49. Q3 [10 points]#

49.1. 1. Assign your response to this string variable your_response1 explaining why there is a performance difference between knn1 and knn2 models that were trained with neighbors 4 and 20 respectively? Justify.#

49.2. 2. Include your response in this variable your_response2 describing if this is a reasonable way to perform normalization?#

your_response1=" "
your_response2=" "

# YOUR CODE HERE
raise NotImplementedError()
#[8 points] Hidden Test cell-6
#DO NOT MODIFY/DELETE THIS CELL