34. Homework-5#
34.1. Total number of points: 70#
35. Due date: October 20th, 2022#
Before you submit this homework, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can discuss with others regarding the homework but all work must be your own.
This homework will test your knowledge on basics of Python. The Python notebooks shared will be helpful to solve these problems.
Steps to evaluate your solutions:
Step-1: Ensure you have installed Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)
Step-2: Open the Jupyter Notebook by first launching the anaconda software console
Step-3: Open the .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”
Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.
Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as .ipynb file format
Step-6: Go to lms.rpi.edu and upload your homework at the appropriate link to submit this homework.
37. Q1 [10 points]. Please make sure this is correct as the following questions are dependent on this.#
Load the boston data from sklearn library (library is already provided here below).
37.1. Part-1:#
The given target labels (boston.target) are continuous – convert them into discrete values – 1 and 2 using this approach and save them in an array
y_true
.all the values (v1) in boston.target should become 1 if v1>=5 and v1<23;
all the values (v2) in boston.target should become 2 if v2>=23 and v2<51;
Note that these new/transformed labels 1 and 2 are integers
Please save these new discrete values in the array
y_true
37.2. Part-2:#
Create a list
y_pred
which is of the same length asy_true
and insert all values as 1.Now use
y_true
andy_pred
to calculate: 1) accuracy and assign it to variabletacc
; 2) precision and assign it to variabletprec
; 3) recall and assign it to variabletrecall
.Create another list
y_pred_prob
which is of the same length asy_true
and insert all values as 0.75. Compute Area under Curve and save the answer astauc
.
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
#Load the dataset and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
# YOUR CODE HERE
raise NotImplementedError()
#[5 points] Test cell-1
#DO NOT MODIFY/DELETE THIS CELL
assert set(y_true)=={1,2}
assert (list(y_true)).count(1)==312
assert (list(y_true)).count(0)==0
assert (list(y_true)).count(2)==194
#[5 points] Test cell-2
#DO NOT MODIFY/DELETE THIS CELL
assert round(tacc, 2)==0.62
assert round(tprec, 2)==0.62
assert round(trecall, 2)==1.0
assert round(tauc, 2)==0.5
38. Q2 [5 points]. Now split the data into training (80%) and testing data (20%) by using these variable names#
X_train
: Training feature columnsX_test
: Testing feature columnsy_train
: Training labelsy_test
: Testing labelswith only parameters df, y_true and ‘test_size’.
df
andy_true
are initialized in the previous question.
import pandas as pd
X_train = pd.DataFrame()
y_train = pd.DataFrame()
X_test = pd.DataFrame()
y_test = pd.DataFrame()
# YOUR CODE HERE
raise NotImplementedError()
#[10 points] Test cell-3
#DO NOT MODIFY/DELETE THIS CELL
assert len(X_train)==404
assert len(y_test)==102
assert len(y_train)==404
assert len(X_test)==102
39. Q3 [5 points]. Use the df
in Q1 to perform these operations:#
- Standardize the data using the StandardScaler() function from sklearn library
- Compute principal components using the fit_transform operation on the original dataset
- Now find the number of principal components `n_components` that will retain not more than 75% of the information present in the original dataset
Make sure you declare appropriate Python packages which are not provided by default
Hint – Use explained_variance_ratio_.cumsum() function we discussed in the class
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
#Load the dataset and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
n_components=0
# YOUR CODE HERE
raise NotImplementedError()
#[2.5 points] Test cell-4
#DO NOT MODIFY/DELETE THIS CELL
assert n_components!=0
#[2.5 points] Test cell-5 Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
40. Q4 [10 points]. Given 3 different series S1
, S2
, S3
#
- First, using these 3 series build a dataframe 'df' where the column name associated with each series is same as the name of the series such as `S1`, `S2` or `S3`.
- Please note that 'df' doesn't contain any class label column. We are building a simple dataframe with 3 columns from 3 different series each considered as a feature.
- If you perform PCA operation, how many maximum principal components will you obtain?
- No need to do pca but manually assign the value to an integer variable `max_pcs`
import pandas as pd
S1 = pd.Series([1,2,3,4]) #corresponding column name should be 'S1'
S2 = pd.Series([10, 20, 30, 40]) #corresponding column name should be 'S2'
S3 = pd.Series([2,4,6,8]) #corresponding column name should be 'S3'
max_pcs=0 #Update this variable after building the dataframe #No need to perform PCA
# YOUR CODE HERE
raise NotImplementedError()
#[7.5 points] Test cell-6
#DO NOT MODIFY/DELETE THIS CELL
assert set(df.columns)=={'S1', 'S2', 'S3'}
assert df['S3'].mean()==5.0
assert df['S1'].mean()==2.5
#[2.5 points] Test cell-7 Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
41. Q5 [10 points]. Linear regression: Use the boston dataset loaded into the dataframe df
#
41.1. 1. Without using train_test_split() function#
use the 1st 400 rows (i.e., index 0,1,….399) where the corresponding features are loaded as
train_X
and corresponding labels astrain_y
Use the remaining rows as testing data –
test_X
andtest_y
for the features and labels respectively.Fit a linear regression line using the training data; then use it to predict labels for testing data as shown in the lecture notebook. Please use the default parameters when calling the linear regression function.
41.2. 2. Measure the mean-squared error (MSE) ‘mse_split1’ using the predicted labels with testing labels.#
Round the
mse_split1
to 2 values after the decimal point.Hint – Check out the libraries given to guess which functions to use to compute MSE.
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn import datasets
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
#Load the dataset without class labels and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
#class label -- boston.target
train_X = pd.DataFrame()
train_y = pd.DataFrame()
test_X = pd.DataFrame()
test_y = pd.DataFrame()
mse_split1=0
# YOUR CODE HERE
raise NotImplementedError()
#[5 points] Test cell-8
#DO NOT MODIFY/DELETE THIS CELL
assert len(train_X)==400
assert len(test_y)==106
assert mse_split1==37.89
#[5 points] Test cell-9 Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
42. Q6 [10 points]. Repeat the above exercise (Q5) but with different training and testing splits using the boston
dataset.#
42.1. 1. Without using train_test_split() function#
use the last set of rows from index 253 to the end (i.e., index 253,254,….505) where, the corresponding features are loaded as
train_X
and corresponding labels astrain_y
Use the remaining rows (from index 0, 1, 2, …., 252) as testing data –
test_X
andtest_y
for the features and labels respectively.Fit a linear regression line using the training data; then use it for predicting testing data as shown in the lecture notebook. Please use the default parameters when calling the linear regression function.
42.2. 2. Measure the mean-squared error (MSE) ‘mse_split2’ using the predicted labels with testing labels.#
Round the
mse_split2
to 2 values after the decimal point.Hint – Check out the libraries given to guess which functions to use to compute MSE.
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn import datasets
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
#Load the dataset without class labels and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
#class label -- boston.target
train_X = pd.DataFrame()
train_y = pd.DataFrame()
test_X = pd.DataFrame()
test_y = pd.DataFrame()
mse_split2=0
# YOUR CODE HERE
raise NotImplementedError()
#[5 points] Test cell-10
#DO NOT MODIFY/DELETE THIS CELL
assert len(train_X)==253
assert len(test_y)==253
assert mse_split2==27.22
#[5 points] Test cell-11 Hidden tests
#DO NOT MODIFY/DELETE THIS CELL
43. Q7 [15 points]. We have loaded the boston dataset – use the dataframe df
to do:#
43.1. 1. Convert the target labels – boston.target from continuous to discrete values – 1, 2 and 3 using this approach.#
all the values (lets say, each value is represented by v1) in boston.target should become 1 if v1>=5 and v1<20;
all the values (lets say, each value is represented by v2) in boston.target should become 2 if v2>=20 and v2<35;
all the values (lets say, each value is represented by v3) in boston.target should become 3 if v3>=35 and v3<51;
43.2. 2. Without using train_test_split() use the 1st 400 rows (i.e., index 0,1,….399) where,#
the corresponding features are loaded as
train_X
and corresponding labels astrain_y
Use the rest of the rows for testing data –
test_X
andtest_y
for the features and groundtruth labels respectively.Fit a logistic regression with solver=
newton-cg
, C=1e5
, multi_class=multinomial
; then use it for predicting testing data as shown in the lecture notebook. If you encounter this warning, please ignore it – “ConvergenceWarning: newton-cg failed to converge.”
43.3. 3. Measure the accuracy acc_split3
using the predicted labels with groundtruth test labels test_y
.#
Round the
acc_split3
to 2 values after the decimal point.Hint – Check out the libraries given to guess which functions to use to compute accuracy.
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn import datasets
from sklearn import linear_model
from sklearn.metrics import accuracy_score
#Load the dataset without class labels and save it as a dataframe
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
#class label -- boston.target
train_X = pd.DataFrame()
train_y = pd.DataFrame()
test_X = pd.DataFrame()
test_y = pd.DataFrame()
acc_split3=0
# YOUR CODE HERE
raise NotImplementedError()
#[5 points] Test cell-12
#DO NOT MODIFY/DELETE THIS CELL
assert len(train_X)==400
assert set(train_y)=={1,2,3}
#[5 points] Test cell-13
#DO NOT MODIFY/DELETE THIS CELL
assert acc_split3==0.74
assert len(test_y)==106
#[5 points] Test cell-14 Hidden tests
#DO NOT MODIFY/DELETE THIS CELL