64. Exam-3: Fall 2022#
65. Total number of points: 51#
This is an open-book test. However, talking or chatting with your classmates is NOT permitted and will be considered plagiarism.
Before you submit this test, make sure your answers are saved, and everything runs as expected.
For empty notebook submissions, you will receive 0 points.
If instructions (including not using the same variable names) are not followed that led to a wrong answer, you will receive 0 points.
First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). Steps to evaluate your solutions:
Step-1: Ensure you have installed Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)
Step-2: Open the Jupyter Notebook by first launching the anaconda software console
Step-3: Open the .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”
Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.
Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as .ipynb file format
Step-6: Go to lms.rpi.edu and upload your notebook at the appropriate link to submit this homework. Make sure your answers are saved before you submit.
67. Q-1 [15 points] Please answer the following questions in the same order as listed here below:#
Select all features that start with a
w
and assign them to the dataframedf_w
.Select all features that start with
cad
and assign them to the dataframedf_cad
.Split
df
(using all columns) into one dataframe (df_1
) that includes the first 7000 rows and a second one (df_2
) that includes the second 3000 rows. Reset the index fordf_2
, dropping it in the process (i.e., it should not be added to the dataframe).
# YOUR CODE HERE
raise NotImplementedError()
#This will test that you have not mislabeled a variable.
assert 'df_w' in globals()
assert 'df_cad' in globals()
assert 'df_1' in globals()
assert 'df_2' in globals()
#Test cell-1 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert df_1.shape[0]==7000
#Test cell-2 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert df_2.shape[0]==3000
#Test cell-3 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert len (set(df_cad.columns).difference({'cad0','cad1','cad2','cad3','cad4','cad5','cad6','cad7','cad8','cad9'}))==0
#Test cell-4 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert len (set(df_w.columns).difference({'w0','w1','w2','w3','w4','w5','w6','w7','w8','w9','w10','w11','w12','w13','w14','w15','w16','w17','w18','w19','w20','w21','w22','w23','w24','w25','w26','w27','w28','w29'}))==0
#Test cell-5 [1pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert df_1.shape[1]==42
assert df_2.shape[1]==42
assert df_2.index[0]==0
68. Q-2 [12 points] Please answer the following questions in the same order as listed here below:#
Use the values X_train
, X_test
, y_train
, y_test
provided below to answer the following questions.
Use a support vector classifier with
random_state
=0 to train a machine learning model.Compute the accuracy for the train and the test sets and assigning to the variables
acc_train
andacc_test
, respectively.Compute the roc_auc_score for the train and the test sets and assigning to the variables
roc_train
androc_test
, respectively.
#This loads our data
from sklearn.model_selection import train_test_split
df = pd.read_csv("https://raw.githubusercontent.com/rpi-techfundamentals/website_fall_2022/master/site/public/data2.csv")
X = df.iloc[:,2:]
y = df['treatment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=0)
#Your solution here. Make sure the cell executes without errors.
# YOUR CODE HERE
raise NotImplementedError()
#This will test that you have not mislabeled a variable.
assert 'acc_train' in globals()
assert 'acc_test' in globals()
assert 'roc_train' in globals()
assert 'roc_test' in globals()
#Test cell-6 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(acc_train,2)==0.87
#Test cell-7 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(acc_test,2)==0.77
#Test cell-8 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(roc_train,2)==0.94
#Test cell-9 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(roc_test,2)==0.85
68.1. Q-3 [24 points] Please answer the following questions in the same order as listed here below:#
Use the values X_train, X_test, y_train, y_test provided below to answer the following questions.
Use a GradientBoostingRegressor() to train a machine learning model to predict the continuous variable
y
(which was named target in the origional dataframe.Compute the r2 for the train and the test sets and assigning to the variables
r2_train
andr2_test
, respectively.Compute the mean_absolute_percentage_error for the train and the test sets and assigning to the variables
mape_train
andmape_test
, respectively.Create a very simple model which uses the mean value from the training data as the prediction for both the training
y_train_pred_s
and the test datay_test_pred_s
, where both should be Pandas Series of the appropriate length.Evaluate the R2 of
y_train_pred_s
(generatingr2_train_s
) andy_test_pred_s
(generatingr2_test_s
).
from sklearn.model_selection import train_test_split
df = pd.read_csv("https://raw.githubusercontent.com/rpi-techfundamentals/website_fall_2022/master/site/public/data2.csv")
X = df.iloc[:,2:]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=0)
#Your solution here. Make sure the cell executes without errors.
# YOUR CODE HERE
raise NotImplementedError()
#This will test that you have not mislabeled a variable.
assert 'r2_train' in globals()
assert 'r2_test' in globals()
assert 'mape_train' in globals()
assert 'mape_test' in globals()
assert 'y_train_pred_s' in globals()
assert 'y_test_pred_s' in globals()
assert 'r2_train_s' in globals()
assert 'r2_test_s' in globals()
#Test cell-10 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(r2_train,1)==0.6
#Test cell-11 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(r2_test,1)==0.4
#Test cell-12 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(mape_train,0)==90
#Test cell-13 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(mape_test,0)==114
#Test cell-14 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(r2_train_s,1)==0.0
#Test cell-15 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(r2_test_s,3)==-0.001
#Test cell-16 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert len(y_train_pred_s)==3000
#Test cell-16 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert len(y_test_pred_s)==7000