16. Homework-3#

16.1. Total number of points: 70#

17. Due date: September 29, 2022#

Before you submit this homework, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can discuss with others regarding the homework but all work must be your own.

This homework will test your knowledge on data manipulation, manipulating strings and feature preprocessing. The Python notebooks shared will be helpful to solve these problems.

Steps to evaluate your solutions:

Step-1: Try on Colab or Anaconda (Windows: https://docs.anaconda.com/anaconda/install/windows/ ; Mac:https://docs.anaconda.com/anaconda/install/mac-os/ ; Linux: https://docs.anaconda.com/anaconda/install/linux/)

Step-2: Open the Jupyter Notebook by first launching the anaconda software console

Step-3: Open the homework’s .ipynb file and write your solutions at the appropriate location “# YOUR CODE HERE”

Step-4: You can restart the kernel and click run all (in the menubar, select Cell → Run All) on the center-right on the top of this window.

Step-5: Now go to “File” then click on “Download as” then click on “Notebook (.ipynb)” Please DO NOT change the file name and just keep it as “.ipynb”

Step-6: Go to lms.rpi.edu and upload your homework at the appropriate link to submit this homework.

18. Please note that for any question in this assignment you will receive points ONLY if your solution passes all the test cases including hidden testcases as well. So please make sure you try to think all possible scenarios before submitting your answers.#

  • Note that hidden tests are present to ensure you are not hardcoding.

  • If caught cheating:

    • you will receive a score of 0 for the 1st violation.

    • for repeated incidents, you will receive an automatic ‘F’ grade and will be reported to the dean of Lally School of Management.

Use the titanic dataset from this url (’https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv’) and we will use the same dataset in some of the questions here below.

19. Q1. Shape of a Data Frame.#

  1. Import pandas package as pd.

  2. Read the file from the ‘url’ and load the data into dataframe ‘df’ with default index.

  3. Set number of rows equal to the variable rows and the number of columns equal to the variable cols.

  4. Print out the number of rows and columns, clearly labeling each.

url='https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv'

# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert rows==891
assert df['Age'].sum()==21205.17
#DO NOT MODIFY/DELETE THIS CELL

Now we will be using the above dataframe df to do some preprocessing operations on this data. All the required libraries for further processing will be loaded here.

import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split

20. Q2 Dataframe Basic Analyses#

Determine how many first, second, and third class (assiging to the variables class1, class2, class3) passangers there are.

Hint – Use value_counts operation

#Your answer here. 

# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert class1>0
assert class2>0
assert class3>0
#DO NOT MODIFY/DELETE THIS CELL

21. Q3 Groupby#

Now use a groupby statement to calculate the mean age (use the ‘age’ attribute) of passengers who are of different gender.

Round the age to 2 decimal places (for example 3.14156 converts to 3.14 ) and assign the resulting variable to female_age, male_age.

#Your answer 

# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert female_age>0
assert isinstance(female_age, np.floating)
assert male_age>0 
assert isinstance(male_age, np.floating)
#DO NOT MODIFY/DELETE THIS CELL

22. Q4 Split Dataframe#

Now split the dataframe df into 3 different dataframes dfclass1, dfclass2, and dfclass3 using the Pclass variable.

#Your answer

# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(dfclass1)>0
assert len(dfclass2)>0
assert len(dfclass3)>0
assert isinstance(dfclass1, pd.DataFrame)
assert isinstance(dfclass2, pd.DataFrame)
assert isinstance(dfclass3, pd.DataFrame)
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL

22.1. Q5 Filter Missing Values#

Create a new dataframe dfna1 from df which removes all rows in which any of the variables are missing.

Create a new dataframe dfna2 from df which removes all rows in which all of the variables Age and Cabin are missing.

# YOUR CODE HERE
raise NotImplementedError()
assert len(dfna1)>0
assert len(dfna2)>0
assert isinstance(dfna1, pd.DataFrame)
assert isinstance(dfna2, pd.DataFrame)
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL

23. Q6 Stratified sampling#

Utilize the original dataframe ‘df’.

  1. Set the y variable to the Survived column.

  2. Set X to the SibSp,Parch, and Fare columns of the dataframe.

  3. Create train_X, test_X, train_y, test_y by doing a 50% 50% split with stratification by the Survived varaible. Use random_state equal to 123.

#Your answer

# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(train_X)>0
assert len(test_X)>0
assert isinstance(train_X, pd.DataFrame)
assert isinstance(test_X, pd.DataFrame)
assert isinstance(train_y, pd.Series)
assert isinstance(test_y, pd.Series)
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL

23.1. Q7 Feature Manipulation#

  1. Consider the original dataframe df. Create a copy of this and call it dfage.

  2. Remove all the rows of dfage in which the Age value is missing.

  3. Then create a new column dfage['Age_st'] which replaces each value in the ‘age’ attribute with the corresponding standardized value rounded to 2 decimal places. You are not modifying dfage['Age'].

Hint: See this for definition of standardized value. https://en.wikipedia.org/wiki/Standard_score You can use np.mean() and np.std() functions to compute the mean and standard deviation values.

# YOUR CODE HERE
raise NotImplementedError()
assert len(dfage)>0
assert isinstance(dfage, pd.DataFrame)
assert 'Age_st' in dfage.columns
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL
#DO NOT MODIFY/DELETE THIS CELL

23.2. Q8 Feature Creation 2#

  1. Create a copy of df called df8.

  2. In df8 create a feature called stown using Embarked.

    For the Embarked is ‘S’ make stown 1.

    Otherwise stown is 0.

# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert set(df8['stown'])=={0,1}
assert df8.shape==(891, 13)
#DO NOT MODIFY/DELETE THIS CELL

24. Q9. Beautiful Soup#

Use the html content shared here below and parse it using the ‘soup’ object to assign all the unique ‘Hometown’ values to as a list object hts

Strip any leading or trailing white space and convert each value of the hometown in hts to a lower-case.

Print the final answer.

import requests
from bs4 import BeautifulSoup
import operator
import pandas as pd
import json
newtext = """
<p>
    <strong class="person1">YOB:</strong> 1990<br />
    <strong class="person1">GENDER:</strong> FEMALE<br />
    <strong class="person1">EYE COLOR:</strong> GREEN<br />
    <strong class="person1">HAIR COLOR:</strong> BROWN<br />
    <strong class="person1">GPA:</strong> 4<br />
    <strong class="person1">Hometown:</strong> Johnstown<br />
</p>

<p>
    <strong class="person2">YOB:</strong> 1993<br />
    <strong class="person2">GENDER:</strong> FEMALE<br />
    <strong class="person2">EYE COLOR:</strong> BROWN<br />
    <strong class="person2">HAIR COLOR:</strong> BLACK<br />
    <strong class="person2">GPA:</strong> 3.5<br />
    <strong class="person2">Hometown:</strong> Pittston<br />
</p>

"""
hts=[]
soup = BeautifulSoup(newtext)
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(hts)==2
assert hts==['johnstown', 'pittston']

25. Q10 String operations#

  1. Given a string str10 first split the string into words, strip any leading or trailing white space, and convert them to lowercase.

  2. Now using the ‘join’ operation we learnt in the class, concatenate these words to a new string str11 with a ‘-’ between each.

For example: str1 is ‘it is cold today’ to str2 will be: ‘it-is-cold-today’

str10 = 'Email the company at xyz@abcd.com and is the easiest way compared to tweet @abcdxyz'
# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(str11)==83
#DO NOT MODIFY/DELETE THIS CELL

26. Q11. Regular Expressions#

Create a function called extract_email which uses the regular expressions package to extract all the email ids mentioned in a sentence.

For example:

extract_email(‘Email the company at jondow@franks.com today.’)

Should return a list: ['jondow@franks.com']

#Your answer here
import re
    
str10 = 'Email the company at jondow@franks.com today.'

# YOUR CODE HERE
raise NotImplementedError()
#DO NOT MODIFY/DELETE THIS CELL
assert len(extract_email('Email the company at jondow@franks.com today.'))==1
assert extract_email('Email the company at jondow@franks.com today.')==['jondow@franks.com']
#DO NOT MODIFY/DELETE THIS CELL

    
#DO NOT MODIFY/DELETE THIS CELL