27. Exam-1: Fall 2022#

28. Total number of points: 40#

This is an open-book test. However, talking or chatting with your classmates is NOT permitted and will be considered plagiarism.

Before you submit this test, make sure your answers are saved, and everything runs as expected.

  • For empty notebook submissions, you will receive 0 points.

  • If instructions (including not using the same variable names) are not followed that led to a wrong answer, you will receive 0 points.

First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). Steps to evaluate your solutions:

Step-1: Ensure you have installed Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)

Step-2: Open the Jupyter Notebook by first launching the anaconda software console

Step-3: Open the .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”

Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.

Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as .ipynb file format

Step-6: Go to lms.rpi.edu and upload your notebook at the appropriate link to submit this homework. Make sure your answers are saved before you submit.

29. Please note that for any question in this test you will receive points ONLY if your solution passes all the hidden testcases. So please make sure you try to think all possible scenarios before submitting your answers.#

  • Note that hidden tests are present to ensure you are not hardcoding.

  • If caught cheating:

    • you will receive a score of 0 for the 1st violation.

    • for repeated incidents, you will receive an automatic ‘F’ grade and will be reported to the dean of Lally School of Management.

30. For this test, we will be using Stroke data provided by WHO. The dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status#

  • id: unique identifier
  • gender: "Male", "Female" or "Other"
  • age: age of the patient
  • hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
  • heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
  • ever_married: "No" or "Yes"
  • work_type: 'Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'
  • Residence_type: "Rural" or "Urban"
  • avg_glucose_level: average glucose level in blood
  • bmi: body mass index
  • smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
  • stroke: 1 if the patient had a stroke or 0 if not -- Class label column
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient
#Please run this cell before answering the following questions 
import numpy as np
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/lmanikon/lmanikon.github.io/master/teaching/datasets/KaggleStroke.csv')
df.head()

31. Q-1 [10 points] Please answer the following questions in the same order:#

  1. Assign variable rows and cols with the total number of data points and the number of attributes present in the dataframe df.

  2. Assign the list colnames with the names of features present in this dataframe df.

  3. Find the value of age associated with the sample whose id is 1665 and assign this value to variable sid1665_age.

  4. Find the total number of missing values in ‘smoking_status’ column and assign it to variable sstat_mvals.

  5. Remove the column id from the dataframe df. Make sure this is reflected in the dataframe.

#Please make sure you use the exact names for variables including the casing of letters

# YOUR CODE HERE
raise NotImplementedError()
#This will test that you have not mislabeled a variable. 
assert 'rows' in globals()
assert 'cols' in globals()
assert 'colnames' in globals()
assert 'sid1665_age' in globals()
assert 'sstat_mvals' in globals()
#Test cell-1 [1pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert cols==12
#Test cell-1 [1pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert rows==5110
#Test cell-2 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert set(colnames)=={'work_type', 'ever_married', 'avg_glucose_level', 'smoking_status', 'stroke', 'gender', 'hypertension', 'bmi', 'Residence_type', 'id', 'heart_disease', 'age'}
#Test cell-3 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert sid1665_age==79.0
#Test cell-4 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert sstat_mvals==0
#Test cell-5 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



assert len(df.columns)==11

32. Q-2 [15 points] Please answer the following questions in the same order:#

  1. Perform one-hot encoding on the dataframe df using Pandas get_dummies(). Save this newly transformed data frame as newdf. For each of the N categories generate N-1 dummy variables. You can pass all values to the get dummies functions and let it decide which is a categorical variable.

  2. Replace all the missing values in this transformed data frame newdf with the average value of the corresponding column.

  3. Split the data frame newdf into a data frame X that includes all the feature columns, and a series y that includes ONLY the class label column, which is stroke. Please follow the case format of variables used here – otherwise, your code wont pass the testcases.

More details about the get_dummies() function can be found here: Link

#Please make sure you use the exact names for variables including the casing of letters

# YOUR CODE HERE
raise NotImplementedError()
#Variable Checks
assert 'newdf' in globals()
assert 'X' in globals()
assert 'y' in globals()
newdf
#Test cell-6 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert len(newdf.columns)in[17,18]
#Test cell-6 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



assert (newdf.isnull().values.sum())==0
#Test cell-7 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



assert round(newdf['gender_Male'].mean(), 2)==0.41
#Test cell-7 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



assert newdf.isnull().values.any()==False
assert round(newdf['bmi'].mean(), 2)==28.89
#Test cell-8 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



assert len(X.columns)in[16,17]
assert round(y.mean(), 2)==0.05

33. Q-3 [15 points] Please answer the following questions in the same order:#

  1. Create a list of values (y2_pred) with all values as 1 and it has the same length as class label y2.

  2. Compute accuracy using the ground truth labels y2 and synthetically created predicted labels y2_pred and save it as a variable accu.

  3. Compute the precision and recall values using y2 and y2_pred and assign these values to variables prec, and recall respectively.

  4. Compute the F-score and assign it to variable fscore.

  5. Transform X2 to X2_st using the MinMaxScaler. X2_st should be a numpy array.

#Load data for this question. Don't change this cell.
X2=pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/website_fall_2022/master/site/public/X2.csv')
y2=X2['stroke']
X2=X2.iloc[:,:4]
#Please make sure you use the exact names for variables including the casing of letters


# YOUR CODE HERE
raise NotImplementedError()
#Variable Checks

assert 'y2_pred' in globals()
assert 'accu' in globals()
assert 'prec' in globals()
assert 'recall' in globals()
assert 'fscore' in globals()
assert 'X2_st' in globals()
#Test cell-10 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



assert len(y2_pred)==5110
assert set(y2_pred)=={1}
#Test cell-11 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



assert round(accu, 2)==0.05
#Test cell-12 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert round(prec, 2)==0.05
#Test cell-12 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



assert round(recall, 2)==1.0
#Test cell-13 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



assert round(fscore, 2)==0.09
#Test cell-14 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 


assert round(X2_st.max(), 1)==1.0
#Test cell-15 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 



type(X2_st)==np.ndarray