44. Homework-6#
44.1. Total number of points: 40#
45. Due date: Nov 10, 2022#
Before you submit this homework, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can discuss with others regarding the homework but all work must be your own.
This homework will test your knowledge on basics of Python. The Python notebooks shared will be helpful to solve these problems.
Steps to evaluate your solutions:
Step-1: Ensure you have installed Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)
Step-2: Open the Jupyter Notebook by first launching the anaconda software console
Step-3: Open the .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”
Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.
Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as .ipynb file format
Step-6: Go to lms.rpi.edu and upload your homework at the appropriate link to submit this homework.
47. Q1 [18 points]#
47.1. 1. Create a new dataframe df_sub
, that is a copy of df
. Drop Serial No.
from df
.#
47.2. 2. Standardize only these attributes in df_sub
using the function RobustScaler()
:#
GRE Score
,TOEFL Score
47.3. 3. Perform normalization only on these attributes in df_sub
using the function StandardScaler()
#
University Rating
,SOP
,LOR
,CGPA
,Research
47.4. Note that after steps 3 and 4, make sure you still have the transformed values saved in df_sub
#
47.5. 4. Create a new column named Admit
using original column Chance of Admit
to create a discrete set of class labels using these conditions. Then drop Chance of Admit
column from df_sub
.#
Convert to 2, if
Chance of Admit
value is>= 0.65
1, if
Chance of Admit
value is< 0.65
#Answering the above questions in the same order as listed
# YOUR CODE HERE
raise NotImplementedError()
#[6 points] Test cell-1
#DO NOT MODIFY/DELETE THIS CELL
assert (len(df_sub))==400
assert (len(df_sub.columns))==8
assert (set(df_sub.columns))=={'LOR', 'SOP', 'CGPA', 'TOEFL Score', 'GRE Score', 'Research', 'Admit', 'University Rating'}
#[6 points] Test cell-2
#DO NOT MODIFY/DELETE THIS CELL
assert (round(np.mean(df_sub['CGPA']), 2))==0.0
assert (round(np.std(df_sub['CGPA']), 2))==1.0
assert (round(np.mean(df_sub['TOEFL Score']), 2))==0.05
#[6 points] Test cell-3
#DO NOT MODIFY/DELETE THIS CELL
assert (round(np.sum(df_sub['Admit']), 2))==687
assert (set(df_sub['Admit']))=={1, 2}
assert (len(df_sub['Admit'].loc[df_sub['Admit']==1]))==113
48. Q2 [15 points]#
48.1. 1. Split the data into X
and y
for feature columns and class label column respectively.#
Feature columns (X):
GRE Score
,TOEFL Score
,University Rating
,SOP
,LOR
,CGPA
,Research
CLass label column (y):
Admit
48.2. 2. Using X
and y
variables representing features and class labels, perform train_test_split operation to build training (X_train
, y_train
) and testing data (X_test
, y_test
).#
Use test_size=0.4, random_state=55 as the parameters for train_test_split() function.
48.3. 3. Train the randomforest classifier (initialized as variable rf
) using these parameters: max_depth=7, random_state=23.#
Using the trained model
rf
to first compute accuracy score and assign it to variableacc_rf
.Then compute the impurity-based feature importances.
Append the names of these top-3 features to a list
impFeatures
. Please make sure you type the feature names exactly as indf_sub
48.4. 4. Train a K-Nearest Neighbors classifier (initialized as variable knn1
and knn2
) using these parameters: n_neighbors=5 and n_neighbors=20 and the kd_tree
algorithm.#
Using the trained models
knn1
andknn2
on training data, compute the accuracy score using test data and assign that to variablesacc_knn_5
andacc_knn_22
respectively fork=2
andk=22
.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
#Answering the above questions in the same order as listed
impFeatures=[]
# YOUR CODE HERE
raise NotImplementedError()
#[8 points] Test cell-4
#DO NOT MODIFY/DELETE THIS CELL
assert (rf.max_depth)==7
assert (rf.n_estimators)==100
#assert round(acc_rf,2)==0.9
assert round(acc_rf,2)<=0.9
assert set(impFeatures)=={'GRE Score', 'CGPA', 'TOEFL Score'}
#[6 points] Test cell-5
#DO NOT MODIFY/DELETE THIS CELL
assert knn1.algorithm=="kd_tree"
assert knn1.algorithm=="kd_tree"
assert (round(acc_knn_5,2))<=0.87
assert (round(acc_knn_22, 2))<=0.9
49. Q3 [10 points]#
49.1. 1. Assign your response to this string variable your_response1
explaining why there is a performance difference between knn1
and knn2
models that were trained with neighbors 4
and 20
respectively? Justify.#
49.2. 2. Include your response in this variable your_response2
describing if this is a reasonable way to perform normalization?#
your_response1=" "
your_response2=" "
# YOUR CODE HERE
raise NotImplementedError()
#[8 points] Hidden Test cell-6
#DO NOT MODIFY/DELETE THIS CELL