Exam-3: Fall 2022

64. Exam-3: Fall 2022#

65. Total number of points: 51#

This is an open-book test. However, talking or chatting with your classmates is NOT permitted and will be considered plagiarism.

Before you submit this test, make sure your answers are saved, and everything runs as expected.

For empty notebook submissions, you will receive 0 points.
If instructions (including not using the same variable names) are not followed that led to a wrong answer, you will receive 0 points.

First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). Steps to evaluate your solutions:

Step-1: Ensure you have installed Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)

Step-2: Open the Jupyter Notebook by first launching the anaconda software console

Step-3: Open the .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”

Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.

Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as .ipynb file format

Step-6: Go to lms.rpi.edu and upload your notebook at the appropriate link to submit this homework. Make sure your answers are saved before you submit.

66. Please note that for any question in this test you will receive points ONLY if your solution passes all the hidden testcases. So please make sure you try to think all possible scenarios before submitting your answers.#

Note that hidden tests are present to ensure you are not hardcoding.
If caught cheating:
- you will receive a score of 0 for the 1st violation.
- for repeated incidents, you will receive an automatic ‘F’ grade and will be reported to the dean of Lally School of Management.

#Please do not modify/delete this cell
import pandas as pd
pd.set_option('display.max_columns', None)
df  = pd.read_csv("https://raw.githubusercontent.com/rpi-techfundamentals/website_fall_2022/master/site/public/data2.csv")
print(df.shape)
df

67. Q-1 [15 points] Please answer the following questions in the same order as listed here below:#

Select all features that start with a w and assign them to the dataframe df_w.
Select all features that start with cad and assign them to the dataframe df_cad.
Split df (using all columns) into one dataframe (df_1) that includes the first 7000 rows and a second one (df_2) that includes the second 3000 rows. Reset the index for df_2, dropping it in the process (i.e., it should not be added to the dataframe).

# YOUR CODE HERE
raise NotImplementedError()

#This will test that you have not mislabeled a variable. 
assert 'df_w' in globals()
assert 'df_cad' in globals()
assert 'df_1' in globals()
assert 'df_2' in globals()

#Test cell-1 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert df_1.shape[0]==7000

#Test cell-2 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert df_2.shape[0]==3000

#Test cell-3 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert len (set(df_cad.columns).difference({'cad0','cad1','cad2','cad3','cad4','cad5','cad6','cad7','cad8','cad9'}))==0

#Test cell-4 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert len (set(df_w.columns).difference({'w0','w1','w2','w3','w4','w5','w6','w7','w8','w9','w10','w11','w12','w13','w14','w15','w16','w17','w18','w19','w20','w21','w22','w23','w24','w25','w26','w27','w28','w29'}))==0

#Test cell-5 [1pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert df_1.shape[1]==42
assert df_2.shape[1]==42
assert df_2.index[0]==0

68. Q-2 [12 points] Please answer the following questions in the same order as listed here below:#

Use the values X_train, X_test, y_train, y_test provided below to answer the following questions.

Use a support vector classifier with random_state=0 to train a machine learning model.
Compute the accuracy for the train and the test sets and assigning to the variables acc_train and acc_test, respectively.
Compute the roc_auc_score for the train and the test sets and assigning to the variables roc_train and roc_test, respectively.

#This loads our data
from sklearn.model_selection import train_test_split
df  = pd.read_csv("https://raw.githubusercontent.com/rpi-techfundamentals/website_fall_2022/master/site/public/data2.csv")
X = df.iloc[:,2:]
y = df['treatment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=0)

#Your solution here. Make sure the cell executes without errors. 

# YOUR CODE HERE
raise NotImplementedError()

#This will test that you have not mislabeled a variable. 
assert 'acc_train' in globals()
assert 'acc_test' in globals()
assert 'roc_train' in globals()
assert 'roc_test' in globals()

#Test cell-6 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 

assert round(acc_train,2)==0.87

#Test cell-7 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 

assert round(acc_test,2)==0.77

#Test cell-8 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 

assert round(roc_train,2)==0.94

#Test cell-9 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 

assert round(roc_test,2)==0.85

68.1. Q-3 [24 points] Please answer the following questions in the same order as listed here below:#

Use the values X_train, X_test, y_train, y_test provided below to answer the following questions.

Use a GradientBoostingRegressor() to train a machine learning model to predict the continuous variable y (which was named target in the origional dataframe.
Compute the r2 for the train and the test sets and assigning to the variables r2_train and r2_test, respectively.
Compute the mean_absolute_percentage_error for the train and the test sets and assigning to the variables mape_train and mape_test, respectively.
Create a very simple model which uses the mean value from the training data as the prediction for both the training y_train_pred_s and the test data y_test_pred_s, where both should be Pandas Series of the appropriate length.
Evaluate the R2 of y_train_pred_s (generating r2_train_s) and y_test_pred_s (generating r2_test_s).

from sklearn.model_selection import train_test_split
df  = pd.read_csv("https://raw.githubusercontent.com/rpi-techfundamentals/website_fall_2022/master/site/public/data2.csv")
X = df.iloc[:,2:]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=0)

#Your solution here. Make sure the cell executes without errors. 

# YOUR CODE HERE
raise NotImplementedError()

#This will test that you have not mislabeled a variable. 
assert 'r2_train' in globals()
assert 'r2_test' in globals()
assert 'mape_train' in globals()
assert 'mape_test' in globals()
assert 'y_train_pred_s' in globals()
assert 'y_test_pred_s' in globals()
assert 'r2_train_s' in globals()
assert 'r2_test_s' in globals()

#Test cell-10 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert round(r2_train,1)==0.6

#Test cell-11 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert round(r2_test,1)==0.4

#Test cell-12 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert round(mape_train,0)==90

#Test cell-13 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 

assert round(mape_test,0)==114

#Test cell-14 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert round(r2_train_s,1)==0.0

#Test cell-15 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert round(r2_test_s,3)==-0.001

#Test cell-16 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert len(y_train_pred_s)==3000

#Test cell-16 [3pt] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert len(y_test_pred_s)==7000