Regression with Stats-Models
rpi.analyticsdojo.com
37. Regression with Stats-Models#
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
37.1. Scikit-learn vs Stats-Models#
Scikit-learn provides framework which enables a similar api (way of interacting with codebase) for many different types of machine learning (i.e., predictive) models.
Stats-Models provices a clear set of results for statistical analsyses (understanding relationships) common to scientific (i.e., explanitory) models
#Get a sample dataset
df = sm.datasets.get_rdataset("Guerry", "HistData").data
37.2. About the Data#
See link.
Andre-Michel Guerry (1833) was the first to systematically collect and analyze social data on such things as crime, literacy and suicide with the view to determining social laws and the relations among these variables.
df.columns
Index(['dept', 'Region', 'Department', 'Crime_pers', 'Crime_prop', 'Literacy',
'Donations', 'Infants', 'Suicides', 'MainCity', 'Wealth', 'Commerce',
'Clergy', 'Crime_parents', 'Infanticide', 'Donation_clergy', 'Lottery',
'Desertion', 'Instruction', 'Prostitutes', 'Distance', 'Area',
'Pop1831'],
dtype='object')
df
dept | Region | Department | Crime_pers | Crime_prop | Literacy | Donations | Infants | Suicides | MainCity | Wealth | Commerce | Clergy | Crime_parents | Infanticide | Donation_clergy | Lottery | Desertion | Instruction | Prostitutes | Distance | Area | Pop1831 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | E | Ain | 28870 | 15890 | 37 | 5098 | 33120 | 35039 | 2:Med | 73 | 58 | 11 | 71 | 60 | 69 | 41 | 55 | 46 | 13 | 218.372 | 5762 | 346.03 |
1 | 2 | N | Aisne | 26226 | 5521 | 51 | 8901 | 14572 | 12831 | 2:Med | 22 | 10 | 82 | 4 | 82 | 36 | 38 | 82 | 24 | 327 | 65.945 | 7369 | 513.00 |
2 | 3 | C | Allier | 26747 | 7925 | 13 | 10973 | 17044 | 114121 | 2:Med | 61 | 66 | 68 | 46 | 42 | 76 | 66 | 16 | 85 | 34 | 161.927 | 7340 | 298.26 |
3 | 4 | E | Basses-Alpes | 12935 | 7289 | 46 | 2733 | 23018 | 14238 | 1:Sm | 76 | 49 | 5 | 70 | 12 | 37 | 80 | 32 | 29 | 2 | 351.399 | 6925 | 155.90 |
4 | 5 | E | Hautes-Alpes | 17488 | 8174 | 69 | 6962 | 23076 | 16171 | 1:Sm | 83 | 65 | 10 | 22 | 23 | 64 | 79 | 35 | 7 | 1 | 320.280 | 5549 | 129.10 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
81 | 86 | W | Vienne | 15010 | 4710 | 25 | 8922 | 35224 | 21851 | 2:Med | 68 | 43 | 71 | 20 | 1 | 44 | 40 | 38 | 65 | 18 | 170.523 | 6990 | 282.73 |
82 | 87 | C | Haute-Vienne | 16256 | 6402 | 13 | 13817 | 19940 | 33497 | 2:Med | 67 | 63 | 76 | 68 | 6 | 78 | 55 | 11 | 84 | 7 | 198.874 | 5520 | 285.13 |
83 | 88 | E | Vosges | 18835 | 9044 | 62 | 4040 | 14978 | 33029 | 2:Med | 82 | 42 | 51 | 58 | 34 | 5 | 14 | 85 | 11 | 43 | 174.477 | 5874 | 397.99 |
84 | 89 | C | Yonne | 18006 | 6516 | 47 | 4276 | 16616 | 12789 | 2:Med | 30 | 15 | 55 | 32 | 22 | 35 | 51 | 66 | 27 | 272 | 81.797 | 7427 | 352.49 |
85 | 200 | NaN | Corse | 2199 | 4589 | 49 | 37015 | 24743 | 37016 | 2:Med | 37 | 83 | 1 | 81 | 2 | 84 | 83 | 9 | 25 | 1 | 539.213 | 8680 | 195.41 |
86 rows × 23 columns
37.3. Predicting Gambling#
Lottery
Per capita wager on Royal Lottery. Ranked ratio of the proceeds bet on the royal lottery to population— Average for the years 1822-1826. (Compte rendus par le ministre des finances)Literacy
Percent Read & Write: Percent of military conscripts who can read and write.Pop1831
Population in 1831, taken from Angeville (1836), Essai sur la Statis- tique de la Population francais, in 1000s.
#Notice this is an R style of Analsysis
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=df).fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Lottery R-squared: 0.348
Model: OLS Adj. R-squared: 0.333
Method: Least Squares F-statistic: 22.20
Date: Mon, 04 Oct 2021 Prob (F-statistic): 1.90e-08
Time: 18:55:31 Log-Likelihood: -379.82
No. Observations: 86 AIC: 765.6
Df Residuals: 83 BIC: 773.0
Df Model: 2
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 246.4341 35.233 6.995 0.000 176.358 316.510
Literacy -0.4889 0.128 -3.832 0.000 -0.743 -0.235
np.log(Pop1831) -31.3114 5.977 -5.239 0.000 -43.199 -19.424
==============================================================================
Omnibus: 3.713 Durbin-Watson: 2.019
Prob(Omnibus): 0.156 Jarque-Bera (JB): 3.394
Skew: -0.487 Prob(JB): 0.183
Kurtosis: 3.003 Cond. No. 702.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
37.4. Alternate Syntax#
This is a more pure way of
df['ln_Pop1831']=np.log(df['Pop1831'])
X = df[['Literacy', 'ln_Pop1831']] # here we have 2 variables for the multiple linear regression.
Y = df['Lottery']
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: Lottery R-squared: 0.348
Model: OLS Adj. R-squared: 0.333
Method: Least Squares F-statistic: 22.20
Date: Mon, 04 Oct 2021 Prob (F-statistic): 1.90e-08
Time: 18:55:32 Log-Likelihood: -379.82
No. Observations: 86 AIC: 765.6
Df Residuals: 83 BIC: 773.0
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 246.4341 35.233 6.995 0.000 176.358 316.510
Literacy -0.4889 0.128 -3.832 0.000 -0.743 -0.235
ln_Pop1831 -31.3114 5.977 -5.239 0.000 -43.199 -19.424
==============================================================================
Omnibus: 3.713 Durbin-Watson: 2.019
Prob(Omnibus): 0.156 Jarque-Bera (JB): 3.394
Skew: -0.487 Prob(JB): 0.183
Kurtosis: 3.003 Cond. No. 702.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
37.4.1. Use Stargazer#
Show multiple different models easily using Stargazer, a Python implementation of an R-package for implementing stepwise regression.
Let’s add a different model.
!pip install Stargazer
Requirement already satisfied: Stargazer in /usr/local/lib/python3.7/dist-packages (0.0.5)
from stargazer.stargazer import Stargazer, LineLocation
X2 = df[['Literacy', 'ln_Pop1831','Crime_pers', 'Crime_prop']]
X2 = sm.add_constant(X2) # adding a constant
model2 = sm.OLS(Y, X2).fit()
stargazer = Stargazer([model,model2])
stargazer
Dependent variable:Lottery | ||
(1) | (2) | |
Crime_pers | 0.000 | |
(0.000) | ||
Crime_prop | 0.002** | |
(0.001) | ||
Literacy | -0.489*** | -0.365*** |
(0.128) | (0.133) | |
const | 246.434*** | 215.240*** |
(35.233) | (38.890) | |
ln_Pop1831 | -31.311*** | -30.191*** |
(5.977) | (6.441) | |
Observations | 86 | 86 |
R2 | 0.348 | 0.409 |
Adjusted R2 | 0.333 | 0.380 |
Residual Std. Error | 20.397 (df=83) | 19.667 (df=81) |
F Statistic | 22.196*** (df=2; 83) | 14.003*** (df=4; 81) |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
37.5. Challenge: Compare Results#
Explore another model of
Lottery
and add it to the Stargazer results.Explore the stargazer documentation and customize the order of the variables, putting the constant and then the variables in all models on top (as is typically done).