27. Exam-1: Fall 2022#
28. Total number of points: 40#
This is an open-book test. However, talking or chatting with your classmates is NOT permitted and will be considered plagiarism.
Before you submit this test, make sure your answers are saved, and everything runs as expected.
For empty notebook submissions, you will receive 0 points.
If instructions (including not using the same variable names) are not followed that led to a wrong answer, you will receive 0 points.
First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). Steps to evaluate your solutions:
Step-1: Ensure you have installed Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)
Step-2: Open the Jupyter Notebook by first launching the anaconda software console
Step-3: Open the .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”
Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.
Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as .ipynb file format
Step-6: Go to lms.rpi.edu and upload your notebook at the appropriate link to submit this homework. Make sure your answers are saved before you submit.
30. For this test, we will be using Stroke data provided by WHO. The dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status#
- id: unique identifier
- gender: "Male", "Female" or "Other"
- age: age of the patient
- hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- ever_married: "No" or "Yes"
- work_type: 'Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'
- Residence_type: "Rural" or "Urban"
- avg_glucose_level: average glucose level in blood
- bmi: body mass index
- smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
- stroke: 1 if the patient had a stroke or 0 if not -- Class label column
#Please run this cell before answering the following questions
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/lmanikon/lmanikon.github.io/master/teaching/datasets/KaggleStroke.csv')
df.head()
31. Q-1 [10 points] Please answer the following questions in the same order:#
Assign variable
rows
andcols
with the total number of data points and the number of attributes present in the dataframe df.Assign the list
colnames
with the names of features present in this dataframe df.Find the value of
age
associated with the sample whoseid
is1665
and assign this value to variablesid1665_age
.Find the total number of missing values in ‘smoking_status’ column and assign it to variable
sstat_mvals
.Remove the column
id
from the dataframedf
. Make sure this is reflected in the dataframe.
#Please make sure you use the exact names for variables including the casing of letters
# YOUR CODE HERE
raise NotImplementedError()
#This will test that you have not mislabeled a variable.
assert 'rows' in globals()
assert 'cols' in globals()
assert 'colnames' in globals()
assert 'sid1665_age' in globals()
assert 'sstat_mvals' in globals()
#Test cell-1 [1pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert cols==12
#Test cell-1 [1pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert rows==5110
#Test cell-2 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert set(colnames)=={'work_type', 'ever_married', 'avg_glucose_level', 'smoking_status', 'stroke', 'gender', 'hypertension', 'bmi', 'Residence_type', 'id', 'heart_disease', 'age'}
#Test cell-3 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert sid1665_age==79.0
#Test cell-4 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert sstat_mvals==0
#Test cell-5 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert len(df.columns)==11
32. Q-2 [15 points] Please answer the following questions in the same order:#
Perform one-hot encoding on the dataframe df using Pandas
get_dummies()
. Save this newly transformed data frame asnewdf
. For each of theN
categories generateN-1
dummy variables. You can pass all values to the get dummies functions and let it decide which is a categorical variable.Replace all the missing values in this transformed data frame
newdf
with the average value of the corresponding column.Split the data frame
newdf
into a data frameX
that includes all the feature columns, and a seriesy
that includes ONLY the class label column, which isstroke
. Please follow the case format of variables used here – otherwise, your code wont pass the testcases.
More details about the get_dummies()
function can be found here:
Link
#Please make sure you use the exact names for variables including the casing of letters
# YOUR CODE HERE
raise NotImplementedError()
#Variable Checks
assert 'newdf' in globals()
assert 'X' in globals()
assert 'y' in globals()
newdf
#Test cell-6 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert len(newdf.columns)in[17,18]
#Test cell-6 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert (newdf.isnull().values.sum())==0
#Test cell-7 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(newdf['gender_Male'].mean(), 2)==0.41
#Test cell-7 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert newdf.isnull().values.any()==False
assert round(newdf['bmi'].mean(), 2)==28.89
#Test cell-8 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert len(X.columns)in[16,17]
assert round(y.mean(), 2)==0.05
33. Q-3 [15 points] Please answer the following questions in the same order:#
Create a list of values (
y2_pred
) with all values as 1 and it has the same length as class labely2
.Compute accuracy using the ground truth labels
y2
and synthetically created predicted labelsy2_pred
and save it as a variableaccu
.Compute the precision and recall values using
y2
andy2_pred
and assign these values to variablesprec
, andrecall
respectively.Compute the
F-score
and assign it to variablefscore
.Transform
X2
toX2_st
using theMinMaxScaler
.X2_st
should be anumpy
array.
#Load data for this question. Don't change this cell.
X2=pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/website_fall_2022/master/site/public/X2.csv')
y2=X2['stroke']
X2=X2.iloc[:,:4]
#Please make sure you use the exact names for variables including the casing of letters
# YOUR CODE HERE
raise NotImplementedError()
#Variable Checks
assert 'y2_pred' in globals()
assert 'accu' in globals()
assert 'prec' in globals()
assert 'recall' in globals()
assert 'fscore' in globals()
assert 'X2_st' in globals()
#Test cell-10 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert len(y2_pred)==5110
assert set(y2_pred)=={1}
#Test cell-11 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(accu, 2)==0.05
#Test cell-12 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(prec, 2)==0.05
#Test cell-12 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(recall, 2)==1.0
#Test cell-13 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(fscore, 2)==0.09
#Test cell-14 [3pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
assert round(X2_st.max(), 1)==1.0
#Test cell-15 [2pts] Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
type(X2_st)==np.ndarray