Revisting Boston Housing with Pytorch
rpi.analyticsdojo.com
73. Revisting Boston Housing with Pytorch#
#!pip install torch torchvision
#Let's get rid of some imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
#Define the model
import torch
import torch.nn as nn
import torch.nn.functional as F
73.1. Overview#
Getting the Data
Reviewing Data
Modeling
Model Evaluation
Using Model
Storing Model
73.2. Getting Data#
Available in the sklearn package as a Bunch object (dictionary).
Available in the UCI data repository.
Better to convert to Pandas dataframe.
#From sklearn tutorial.
from sklearn.datasets import load_boston
boston = load_boston()
print( "Type of boston dataset:", type(boston))
#A bunch is you remember is a dictionary based dataset. Dictionaries are addressed by keys.
#Let's look at the keys.
print(boston.keys())
#DESCR sounds like it could be useful. Let's print the description.
print(boston['DESCR'])
# Let's change the data to a Panda's Dataframe
import pandas as pd
boston_df = pd.DataFrame(boston['data'] )
boston_df.head()
#Now add the column names.
boston_df.columns = boston['feature_names']
boston_df.head()
#Add the target as PRICE.
boston_df['PRICE']= boston['target']
boston_df.head()
73.3. Attribute Information (in order):#
Looks like they are all continuous IV and continuous DV. - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per 10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in 1000’s Let’s check for missing values.
import numpy as np
#check for missing values
print(np.sum(np.isnan(boston_df)))
73.4. What type of data are there?#
First let’s focus on the dependent variable, as the nature of the DV is critical to selection of model.
Median value of owner-occupied homes in $1000’s is the Dependent Variable (continuous variable).
It is relevant to look at the distribution of the dependent variable, so let’s do that first.
Here there is a normal distribution for the most part, with some at the top end of the distribution we could explore later.
#Let's us seaborn, because it is pretty. ;)
#See more here. http://seaborn.pydata.org/tutorial/distributions.html
import seaborn as sns
sns.distplot(boston_df['PRICE']);
#We can quickly look at other data.
#Look at the bottom row to see thinks likely coorelated with price.
#Look along the diagonal to see histograms of each.
sns.pairplot(boston_df);
73.5. Preparing to Model#
It is common to separate
y
as the dependent variable andX
as the matrix of independent variables.Here we are using
train_test_split
to split the test and train.This creates 4 subsets, with IV and DV separted:
X_train, X_test, y_train, y_test
#This will throw and error at import if haven't upgraded.
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
#y is the dependent variable.
y = boston_df['PRICE']
#As we know, iloc is used to slice the array by index number. Here this is the matrix of
#independent variables.
X = boston_df.iloc[:,0:13]
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
#Define training hyperprameters.
batch_size = 50
num_epochs = 200
learning_rate = 0.01
size_hidden= 100
#Calculate some other hyperparameters based on data.
batch_no = len(X_train) // batch_size #batches
cols=X_train.shape[1] #Number of columns in input matrix
n_output=1
#Create the model
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Assume that we are on a CUDA machine, then this should print a CUDA device:
print("Executing the model on :",device)
class Net(torch.nn.Module):
def __init__(self, n_feature, size_hidden, n_output):
super(Net, self).__init__()
self.hidden = torch.nn.Linear(cols, size_hidden) # hidden layer
self.predict = torch.nn.Linear(size_hidden, n_output) # output layer
def forward(self, x):
x = F.relu(self.hidden(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
net = Net(cols, size_hidden, n_output)
#Adam is a specific flavor of gradient decent which is typically better
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
#optimizer = torch.optim.SGD(net.parameters(), lr=0.2)
criterion = torch.nn.MSELoss(size_average=False) # this is for regression mean squared loss
#Change to numpy arraay.
X_train=X_train.values
y_train=y_train.values
X_test=X_test.values
y_test=y_test.values
from sklearn.utils import shuffle
from torch.autograd import Variable
running_loss = 0.0
for epoch in range(num_epochs):
#Shuffle just mixes up the dataset between epocs
X_train, y_train = shuffle(X_train, y_train)
# Mini batch learning
for i in range(batch_no):
start = i * batch_size
end = start + batch_size
inputs = Variable(torch.FloatTensor(X_train[start:end]))
labels = Variable(torch.FloatTensor(y_train[start:end]))
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
#print("outputs",outputs)
#print("outputs",outputs,outputs.shape,"labels",labels, labels.shape)
loss = criterion(outputs, torch.unsqueeze(labels,dim=1))
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
print('Epoch {}'.format(epoch+1), "loss: ",running_loss)
running_loss = 0.0
import pandas as pd
from sklearn.metrics import r2_score
X = Variable(torch.FloatTensor(X_train))
result = net(X)
pred=result.data[:,0].numpy()
print(len(pred),len(y_train))
r2_score(pred,y_train)
import pandas as pd
from sklearn.metrics import r2_score
#This is a little bit tricky to get the resulting prediction.
def calculate_r2(x,y=[]):
"""
This function will return the r2 if passed x and y or return predictions if just passed x.
"""
# Evaluate the model with the test set.
X = Variable(torch.FloatTensor(x))
result = net(X) #This outputs the value for regression
result=result.data[:,0].numpy()
if len(y) != 0:
r2=r2_score(result, y)
print("R-Squared", r2)
#print('Accuracy {:.2f}'.format(num_right / len(y)), "for a total of ", len(y), "records")
return pd.DataFrame(data= {'actual': y, 'predicted': result})
else:
print("returning predictions")
return result
result1=calculate_r2(X_train,y_train)
result2=calculate_r2(X_test,y_test)
73.6. Modeling#
First import the package:
from sklearn.linear_model import LinearRegression
Then create the model object.
Then fit the data.
This creates a trained model (an object) of class regression.
The variety of methods and attributes available for regression are shown here.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit( X_train, y_train )
73.7. Evaluating the Model Results#
You have fit a model.
You can now store this model, save the object to disk, or evaluate it with different outcomes.
Trained regression objects have coefficients (
coef_
) and intercepts (intercept_
) as attributes.R-Squared is determined from the
score
method of the regression object.For Regression, we are going to use the coefficient of determination as our way of evaluating the results, also referred to as R-Squared
print('R2 for Train)', lm.score( X_train, y_train ))
print('R2 for Test (cross validation)', lm.score(X_test, y_test))
Copyright AnalyticsDojo 2016. This work is licensed under the Creative Commons Attribution 4.0 International license agreement.