34. Homework-5#

34.1. Total number of points: 70#

35. Due date: October 20th, 2022#

Before you submit this homework, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can discuss with others regarding the homework but all work must be your own.

This homework will test your knowledge on basics of Python. The Python notebooks shared will be helpful to solve these problems.

Steps to evaluate your solutions:

Step-1: Ensure you have installed Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)

Step-2: Open the Jupyter Notebook by first launching the anaconda software console

Step-3: Open the .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”

Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.

Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as .ipynb file format

Step-6: Go to lms.rpi.edu and upload your homework at the appropriate link to submit this homework.

36. Please note that for any question in this assignment you will receive points ONLY if your solution passes all the test cases including hidden testcases as well. So please make sure you try to think all possible scenarios before submitting your answers.#

  • Note that hidden tests are present to ensure you are not hardcoding.

  • If caught cheating:

    • you will receive a score of 0 for the 1st violation.

    • for repeated incidents, you will receive an automatic ‘F’ grade and will be reported to the dean of Lally School of Management.

37. Q1 [10 points]. Please make sure this is correct as the following questions are dependent on this.#

  • Load the boston data from sklearn library (library is already provided here below).

37.1. Part-1:#

  • The given target labels (boston.target) are continuous – convert them into discrete values – 1 and 2 using this approach and save them in an array y_true.

  • all the values (v1) in boston.target should become 1 if v1>=5 and v1<23;

  • all the values (v2) in boston.target should become 2 if v2>=23 and v2<51;

  • Note that these new/transformed labels 1 and 2 are integers

  • Please save these new discrete values in the array y_true

37.2. Part-2:#

  • Create a list y_pred which is of the same length as y_true and insert all values as 1.

  • Now use y_true and y_pred to calculate: 1) accuracy and assign it to variable tacc; 2) precision and assign it to variable tprec; 3) recall and assign it to variable trecall.

  • Create another list y_pred_prob which is of the same length as y_true and insert all values as 0.75. Compute Area under Curve and save the answer as tauc.

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

#Load the dataset and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)

# YOUR CODE HERE
raise NotImplementedError()
#[5 points] Test cell-1 
#DO NOT MODIFY/DELETE THIS CELL 
assert set(y_true)=={1,2}
assert (list(y_true)).count(1)==312
assert (list(y_true)).count(0)==0
assert (list(y_true)).count(2)==194
#[5 points] Test cell-2
#DO NOT MODIFY/DELETE THIS CELL 
assert round(tacc, 2)==0.62
assert round(tprec, 2)==0.62
assert round(trecall, 2)==1.0
assert round(tauc, 2)==0.5

38. Q2 [5 points]. Now split the data into training (80%) and testing data (20%) by using these variable names#

  • X_train: Training feature columns

  • X_test: Testing feature columns

  • y_train: Training labels

  • y_test: Testing labels

  • with only parameters df, y_true and ‘test_size’. df and y_true are initialized in the previous question.

import pandas as pd

X_train = pd.DataFrame()
y_train = pd.DataFrame()
X_test = pd.DataFrame()
y_test = pd.DataFrame()

# YOUR CODE HERE
raise NotImplementedError()
#[10 points] Test cell-3
#DO NOT MODIFY/DELETE THIS CELL 
assert len(X_train)==404
assert len(y_test)==102
assert len(y_train)==404
assert len(X_test)==102

39. Q3 [5 points]. Use the df in Q1 to perform these operations:#

  1. Standardize the data using the StandardScaler() function from sklearn library
  2. Compute principal components using the fit_transform operation on the original dataset
  3. Now find the number of principal components `n_components` that will retain not more than 75% of the information present in the original dataset
  • Make sure you declare appropriate Python packages which are not provided by default

  • Hint – Use explained_variance_ratio_.cumsum() function we discussed in the class

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

#Load the dataset and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)

n_components=0

# YOUR CODE HERE
raise NotImplementedError()
#[2.5 points] Test cell-4
#DO NOT MODIFY/DELETE THIS CELL 
assert n_components!=0
#[2.5 points] Test cell-5 Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 

40. Q4 [10 points]. Given 3 different series S1, S2, S3#

  1. First, using these 3 series build a dataframe 'df' where the column name associated with each series is same as the name of the series such as `S1`, `S2` or `S3`.
  2. Please note that 'df' doesn't contain any class label column. We are building a simple dataframe with 3 columns from 3 different series each considered as a feature.
  3. If you perform PCA operation, how many maximum principal components will you obtain?
  4. No need to do pca but manually assign the value to an integer variable `max_pcs`
import pandas as pd
S1 = pd.Series([1,2,3,4]) #corresponding column name should be 'S1'
S2 = pd.Series([10, 20, 30, 40]) #corresponding column name should be 'S2'
S3 = pd.Series([2,4,6,8]) #corresponding column name should be 'S3'

max_pcs=0 #Update this variable after building the dataframe #No need to perform PCA 

# YOUR CODE HERE
raise NotImplementedError()
#[7.5 points] Test cell-6
#DO NOT MODIFY/DELETE THIS CELL 
assert set(df.columns)=={'S1', 'S2', 'S3'}
assert df['S3'].mean()==5.0
assert df['S1'].mean()==2.5
#[2.5 points] Test cell-7 Hidden tests
#DO NOT MODIFY/DELETE THIS CELL 

41. Q5 [10 points]. Linear regression: Use the boston dataset loaded into the dataframe df#

41.1. 1. Without using train_test_split() function#

  • use the 1st 400 rows (i.e., index 0,1,….399) where the corresponding features are loaded as train_X and corresponding labels as train_y

  • Use the remaining rows as testing data – test_X and test_y for the features and labels respectively.

  • Fit a linear regression line using the training data; then use it to predict labels for testing data as shown in the lecture notebook. Please use the default parameters when calling the linear regression function.

41.2. 2. Measure the mean-squared error (MSE) ‘mse_split1’ using the predicted labels with testing labels.#

  • Round the mse_split1 to 2 values after the decimal point.

  • Hint – Check out the libraries given to guess which functions to use to compute MSE.

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn import datasets
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

#Load the dataset without class labels and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
#class label -- boston.target

train_X = pd.DataFrame()
train_y = pd.DataFrame()
test_X = pd.DataFrame()
test_y = pd.DataFrame()

mse_split1=0

# YOUR CODE HERE
raise NotImplementedError()
#[5 points] Test cell-8
#DO NOT MODIFY/DELETE THIS CELL 
assert len(train_X)==400
assert len(test_y)==106
assert mse_split1==37.89
#[5 points] Test cell-9 Hidden tests 
#DO NOT MODIFY/DELETE THIS CELL 

42. Q6 [10 points]. Repeat the above exercise (Q5) but with different training and testing splits using the boston dataset.#

42.1. 1. Without using train_test_split() function#

  • use the last set of rows from index 253 to the end (i.e., index 253,254,….505) where, the corresponding features are loaded as train_X and corresponding labels as train_y

  • Use the remaining rows (from index 0, 1, 2, …., 252) as testing data – test_X and test_y for the features and labels respectively.

  • Fit a linear regression line using the training data; then use it for predicting testing data as shown in the lecture notebook. Please use the default parameters when calling the linear regression function.

42.2. 2. Measure the mean-squared error (MSE) ‘mse_split2’ using the predicted labels with testing labels.#

  • Round the mse_split2 to 2 values after the decimal point.

  • Hint – Check out the libraries given to guess which functions to use to compute MSE.

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn import datasets
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

#Load the dataset without class labels and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
#class label -- boston.target

train_X = pd.DataFrame()
train_y = pd.DataFrame()
test_X = pd.DataFrame()
test_y = pd.DataFrame()

mse_split2=0

# YOUR CODE HERE
raise NotImplementedError()
#[5 points] Test cell-10
#DO NOT MODIFY/DELETE THIS CELL 
assert len(train_X)==253
assert len(test_y)==253
assert mse_split2==27.22
#[5 points] Test cell-11 Hidden tests 
#DO NOT MODIFY/DELETE THIS CELL 

43. Q7 [15 points]. We have loaded the boston dataset – use the dataframe df to do:#

43.1. 1. Convert the target labels – boston.target from continuous to discrete values – 1, 2 and 3 using this approach.#

  • all the values (lets say, each value is represented by v1) in boston.target should become 1 if v1>=5 and v1<20;

  • all the values (lets say, each value is represented by v2) in boston.target should become 2 if v2>=20 and v2<35;

  • all the values (lets say, each value is represented by v3) in boston.target should become 3 if v3>=35 and v3<51;

43.2. 2. Without using train_test_split() use the 1st 400 rows (i.e., index 0,1,….399) where,#

  • the corresponding features are loaded as train_X and corresponding labels as train_y

  • Use the rest of the rows for testing data – test_X and test_y for the features and groundtruth labels respectively.

  • Fit a logistic regression with solver=newton-cg, C=1e5, multi_class=multinomial; then use it for predicting testing data as shown in the lecture notebook. If you encounter this warning, please ignore it – “ConvergenceWarning: newton-cg failed to converge.”

43.3. 3. Measure the accuracy acc_split3 using the predicted labels with groundtruth test labels test_y.#

  • Round the acc_split3 to 2 values after the decimal point.

  • Hint – Check out the libraries given to guess which functions to use to compute accuracy.

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn import datasets
from sklearn import linear_model
from sklearn.metrics import accuracy_score

#Load the dataset without class labels and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
#class label -- boston.target

train_X = pd.DataFrame()
train_y = pd.DataFrame()
test_X = pd.DataFrame()
test_y = pd.DataFrame()

acc_split3=0

# YOUR CODE HERE
raise NotImplementedError()
#[5 points] Test cell-12
#DO NOT MODIFY/DELETE THIS CELL 
assert len(train_X)==400
assert set(train_y)=={1,2,3}
#[5 points] Test cell-13
#DO NOT MODIFY/DELETE THIS CELL 
assert acc_split3==0.74
assert len(test_y)==106
#[5 points] Test cell-14 Hidden tests 
#DO NOT MODIFY/DELETE THIS CELL