16. Homework-3#
16.1. Total number of points: 70#
17. Due date: September 29, 2022#
Before you submit this homework, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can discuss with others regarding the homework but all work must be your own.
This homework will test your knowledge on data manipulation, manipulating strings and feature preprocessing. The Python notebooks shared will be helpful to solve these problems.
Steps to evaluate your solutions:
Step-1: Try on Colab or Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)
Step-2: Open the Jupyter Notebook by first launching the anaconda software console
Step-3: Open the homework’s .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”
Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.
Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as “.ipynb”
Step-6: Go to lms.rpi.edu and upload your homework at the appropriate link to submit this homework.
19. Q1. Shape of a Data Frame.#
Import pandas package as pd.
Read the file from the ‘url’ and load the data into dataframe ‘df’ with default index.
Set number of rows equal to the variable
rows
and the number of columns equal to the variablecols
.Print out the number of rows and columns, clearly labeling each.
url='https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv'
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert rows==891
assert df['Age'].sum()==21205.17
#DO NOT MODIFY/DELETE THIS CELL
Now we will be using the above dataframe df to do some preprocessing operations on this data. All the required libraries for further processing will be loaded here.
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
20. Q2 Dataframe Basic Analyses#
Determine how many first, second, and third class (assiging to the variables class1
, class2
, class3
) passangers there are.
Hint – Use value_counts operation
#Your answer here.
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert class1>0
assert class2>0
assert class3>0
#DO NOT MODIFY/DELETE THIS CELL
21. Q3 Groupby#
Now use a groupby statement to calculate the mean age (use the ‘age’ attribute) of passengers who are of different gender.
Round the age to 2 decimal places (for example 3.14156 converts to 3.14 ) and assign the resulting variable to female_age
, male_age
.
#Your answer
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert female_age>0
assert isinstance(female_age, np.floating)
assert male_age>0
assert isinstance(male_age, np.floating)
#DO NOT MODIFY/DELETE THIS CELL
22. Q4 Split Dataframe#
Now split the dataframe df
into 3 different dataframes dfclass1
, dfclass2
, and dfclass3
using the Pclass
variable.
#Your answer
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(dfclass1)>0
assert len(dfclass2)>0
assert len(dfclass3)>0
assert isinstance(dfclass1, pd.DataFrame)
assert isinstance(dfclass2, pd.DataFrame)
assert isinstance(dfclass3, pd.DataFrame)
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
22.1. Q5 Filter Missing Values#
Create a new dataframe dfna1
from df
which removes all rows in which any of the variables are missing.
Create a new dataframe dfna2
from df
which removes all rows in which all of the variables Age
and Cabin
are missing.
# YOUR CODE HERE
raise NotImplementedError()
assert len(dfna1)>0
assert len(dfna2)>0
assert isinstance(dfna1, pd.DataFrame)
assert isinstance(dfna2, pd.DataFrame)
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
23. Q6 Stratified sampling#
Utilize the original dataframe ‘df’.
Set the
y
variable to theSurvived
column.Set X to the
SibSp
,Parch
, andFare
columns of the dataframe.Create
train_X
,test_X
,train_y
,test_y
by doing a 50% 50% split with stratification by theSurvived
varaible. Userandom_state
equal to 123.
#Your answer
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(train_X)>0
assert len(test_X)>0
assert isinstance(train_X, pd.DataFrame)
assert isinstance(test_X, pd.DataFrame)
assert isinstance(train_y, pd.Series)
assert isinstance(test_y, pd.Series)
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
23.1. Q7 Feature Manipulation#
Consider the original dataframe
df
. Create a copy of this and call itdfage
.Remove all the rows of
dfage
in which theAge
value is missing.Then create a new column
dfage['Age_st']
which replaces each value in the ‘age’ attribute with the corresponding standardized value rounded to 2 decimal places. You are not modifyingdfage['Age']
.
Hint: See this for definition of standardized value. https://en.wikipedia.org/wiki/Standard_score You can use np.mean() and np.std() functions to compute the mean and standard deviation values.
# YOUR CODE HERE
raise NotImplementedError()
assert len(dfage)>0
assert isinstance(dfage, pd.DataFrame)
assert 'Age_st' in dfage.columns
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
23.2. Q8 Feature Creation 2#
Create a copy of df called
df8
.In
df8
create a feature calledstown
usingEmbarked
.For the
Embarked
is ‘S’ makestown
1.Otherwise
stown
is 0.
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert set(df8['stown'])=={0,1}
assert df8.shape==(891, 13)
#DO NOT MODIFY/DELETE THIS CELL
24. Q9. Beautiful Soup#
Use the html content shared here below and parse it using the ‘soup’ object to
assign all the unique ‘Hometown’ values to as a list object hts
Strip any leading or trailing white space and convert each value of the hometown in hts
to a lower-case.
Print the final answer.
import requests
from bs4 import BeautifulSoup
import operator
import pandas as pd
import json
newtext = """
<p>
<strong class="person1">YOB:</strong> 1990<br />
<strong class="person1">GENDER:</strong> FEMALE<br />
<strong class="person1">EYE COLOR:</strong> GREEN<br />
<strong class="person1">HAIR COLOR:</strong> BROWN<br />
<strong class="person1">GPA:</strong> 4<br />
<strong class="person1">Hometown:</strong> Johnstown<br />
</p>
<p>
<strong class="person2">YOB:</strong> 1993<br />
<strong class="person2">GENDER:</strong> FEMALE<br />
<strong class="person2">EYE COLOR:</strong> BROWN<br />
<strong class="person2">HAIR COLOR:</strong> BLACK<br />
<strong class="person2">GPA:</strong> 3.5<br />
<strong class="person2">Hometown:</strong> Pittston<br />
</p>
"""
hts=[]
soup = BeautifulSoup(newtext)
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(hts)==2
assert hts==['johnstown', 'pittston']
25. Q10 String operations#
Given a string
str10
first split the string into words, strip any leading or trailing white space, and convert them to lowercase.Now using the ‘join’ operation we learnt in the class, concatenate these words to a new string
str11
with a ‘-’ between each.
For example: str1 is ‘it is cold today’ to str2 will be: ‘it-is-cold-today’
str10 = 'Email the company at xyz@abcd.com and is the easiest way compared to tweet @abcdxyz'
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(str11)==83
#DO NOT MODIFY/DELETE THIS CELL
26. Q11. Regular Expressions#
Create a function called extract_email
which uses the regular expressions package to extract all the email ids mentioned in a sentence.
For example:
extract_email(‘Email the company at jondow@franks.com today.’)
Should return a list: ['jondow@franks.com']
#Your answer here
import re
str10 = 'Email the company at jondow@franks.com today.'
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(extract_email('Email the company at jondow@franks.com today.'))==1
assert extract_email('Email the company at jondow@franks.com today.')==['jondow@franks.com']
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL