50. Homework-7#
50.1. Total number of points: 40#
51. Due date: Nov 21, 2022#
Before you submit this homework, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can discuss with others regarding the homework but all work must be your own.
This homework will test your knowledge on random forests (including feature importances), and neural networks. The Python notebooks shared will be helpful to solve these problems.
Steps to evaluate your solutions:
Step-1: Try on Colab or Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)
Step-2: Open the Jupyter Notebook by first launching the anaconda software console
Step-3: Open the homework’s .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”
Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.
Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as “.ipynb”
Step-6: Go to lms.rpi.edu and upload your homework at the appropriate link to submit this homework.
53. 1. Create a new dataframe df_sub2
that contains only these feature columns in df
:#
Page total likes
,Type
,Category
,Post Month
,Post Weekday
,Post Hour
,Paid
,Total Interactions
54. 2. Transform the categorical attribute Type
in df_sub2
to numerical attribute this way:#
Link
:1,Photo
:2,Status
:3,Video
:4
55. 3. Perform Standardization (using this formula https://en.wikipedia.org/wiki/Standard_score) only on Page total likes
column in df_sub2
#
Please use
<dataframe>['<column>'].mean()
<dataframe>['<column>'].std()
if you are using the mean and std values to manipulate the column.
56. 4. Using df_sub2
perform train_test_split operation to build training (X_train
, y_train
) and testing data (X_test
, y_test
).#
Use test_size=0.3, random_state=42 as the parameters for train_test_split() function.
Feature columns (X):
Page total likes
,Type
,Category
,Post Month
,Post Weekday
,Post Hour
,Paid
CLass label column (y):
Total Interactions
57. 5. Train the randomforest regressor (initialized as variable rf
) using these parameters: max_depth=3, random_state=0.#
Using the trained model
rf
compute the impurity-based feature importances.Append the names of these top-3 features to list
impFeatures
. Please make sure you type the feature names exactly as in df_sub2
# YOUR CODE HERE
raise NotImplementedError()
##Cell-1 -- Do not modify this cell
assert set(df_sub2.columns)=={'Post Hour', 'Total Interactions', 'Post Weekday', 'Category', 'Paid', 'Post Month', 'Page total likes', 'Type'}
assert len(df_sub2)==495
assert len(df_sub2.iloc[0,:])==8
##Cell-2 -- Do not modify this cell
assert round(df_sub2['Page total likes'].mean(),0)==0
assert len(X_train)==346
assert 'Post Hour' in impFeatures
##Cell-3 -- Do not modify this cell
#assert rf.n_classes_==230 #THIS WAS AN INCORRECT TEST.
assert len(impFeatures)==3
assert math.ceil(df_sub2['Page total likes'].std())==1
##Cell-4 -- Do not modify this cell
assert (rf.n_features_)==7
assert len(y_test)==149
assert set(impFeatures)=={'Page total likes', 'Post Hour', 'Post Weekday'}
58. Part-2 [16 points]: We will now use neural networks to model a regression problem.#
59. 1. Create a new dataframe df_sub4
that contains only these feature columns in df
:#
Page total likes
,Type
,Category
,Post Month
,Post Weekday
,Post Hour
,Paid
,Total Interactions
60. 2. Perform one-hot encoding on these features below in the dataframe df_sub4
to create a new dataframe df_OHE
#
Type
,Category
,Post Month
,Post Weekday
,Post Hour
61. 3. Perform normalization only on Page total likes
and Total Interactions
column in df_OHE
using MinMaxScaler or minmax_scale#
note that both these functions will output the same result
62. 4. Using df_OHE
perform train_test_split operation to build training (X_train
, y_train
) and testing data (X_test
, y_test
).#
Use test_size=0.3, random_state=42 as the parameters for train_test_split() function.
Feature columns (X): Everything except
Total Interactions
CLass label column (y):
Total Interactions