AnalyticsDojo

Regression with Stats-Models

rpi.analyticsdojo.com

37. Regression with Stats-Models#

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

37.1. Scikit-learn vs Stats-Models#

  • Scikit-learn provides framework which enables a similar api (way of interacting with codebase) for many different types of machine learning (i.e., predictive) models.

  • Stats-Models provices a clear set of results for statistical analsyses (understanding relationships) common to scientific (i.e., explanitory) models

#Get a sample dataset
df = sm.datasets.get_rdataset("Guerry", "HistData").data

37.2. About the Data#

  • See link.

  • Andre-Michel Guerry (1833) was the first to systematically collect and analyze social data on such things as crime, literacy and suicide with the view to determining social laws and the relations among these variables.

df.columns
Index(['dept', 'Region', 'Department', 'Crime_pers', 'Crime_prop', 'Literacy',
       'Donations', 'Infants', 'Suicides', 'MainCity', 'Wealth', 'Commerce',
       'Clergy', 'Crime_parents', 'Infanticide', 'Donation_clergy', 'Lottery',
       'Desertion', 'Instruction', 'Prostitutes', 'Distance', 'Area',
       'Pop1831'],
      dtype='object')
df
dept Region Department Crime_pers Crime_prop Literacy Donations Infants Suicides MainCity Wealth Commerce Clergy Crime_parents Infanticide Donation_clergy Lottery Desertion Instruction Prostitutes Distance Area Pop1831
0 1 E Ain 28870 15890 37 5098 33120 35039 2:Med 73 58 11 71 60 69 41 55 46 13 218.372 5762 346.03
1 2 N Aisne 26226 5521 51 8901 14572 12831 2:Med 22 10 82 4 82 36 38 82 24 327 65.945 7369 513.00
2 3 C Allier 26747 7925 13 10973 17044 114121 2:Med 61 66 68 46 42 76 66 16 85 34 161.927 7340 298.26
3 4 E Basses-Alpes 12935 7289 46 2733 23018 14238 1:Sm 76 49 5 70 12 37 80 32 29 2 351.399 6925 155.90
4 5 E Hautes-Alpes 17488 8174 69 6962 23076 16171 1:Sm 83 65 10 22 23 64 79 35 7 1 320.280 5549 129.10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
81 86 W Vienne 15010 4710 25 8922 35224 21851 2:Med 68 43 71 20 1 44 40 38 65 18 170.523 6990 282.73
82 87 C Haute-Vienne 16256 6402 13 13817 19940 33497 2:Med 67 63 76 68 6 78 55 11 84 7 198.874 5520 285.13
83 88 E Vosges 18835 9044 62 4040 14978 33029 2:Med 82 42 51 58 34 5 14 85 11 43 174.477 5874 397.99
84 89 C Yonne 18006 6516 47 4276 16616 12789 2:Med 30 15 55 32 22 35 51 66 27 272 81.797 7427 352.49
85 200 NaN Corse 2199 4589 49 37015 24743 37016 2:Med 37 83 1 81 2 84 83 9 25 1 539.213 8680 195.41

86 rows × 23 columns

37.3. Predicting Gambling#

  • Lottery Per capita wager on Royal Lottery. Ranked ratio of the proceeds bet on the royal lottery to population— Average for the years 1822-1826. (Compte rendus par le ministre des finances)

  • Literacy Percent Read & Write: Percent of military conscripts who can read and write.

  • Pop1831 Population in 1831, taken from Angeville (1836), Essai sur la Statis- tique de la Population francais, in 1000s.

#Notice this is an R style of Analsysis
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=df).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Mon, 04 Oct 2021   Prob (F-statistic):           1.90e-08
Time:                        18:55:31   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         246.4341     35.233      6.995      0.000     176.358     316.510
Literacy           -0.4889      0.128     -3.832      0.000      -0.743      -0.235
np.log(Pop1831)   -31.3114      5.977     -5.239      0.000     -43.199     -19.424
==============================================================================
Omnibus:                        3.713   Durbin-Watson:                   2.019
Prob(Omnibus):                  0.156   Jarque-Bera (JB):                3.394
Skew:                          -0.487   Prob(JB):                        0.183
Kurtosis:                       3.003   Cond. No.                         702.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

37.4. Alternate Syntax#

This is a more pure way of

df['ln_Pop1831']=np.log(df['Pop1831'])
X = df[['Literacy', 'ln_Pop1831']] # here we have 2 variables for the multiple linear regression. 
Y = df['Lottery']

X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
print_model = model.summary()
print(print_model)
    
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Mon, 04 Oct 2021   Prob (F-statistic):           1.90e-08
Time:                        18:55:32   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        246.4341     35.233      6.995      0.000     176.358     316.510
Literacy      -0.4889      0.128     -3.832      0.000      -0.743      -0.235
ln_Pop1831   -31.3114      5.977     -5.239      0.000     -43.199     -19.424
==============================================================================
Omnibus:                        3.713   Durbin-Watson:                   2.019
Prob(Omnibus):                  0.156   Jarque-Bera (JB):                3.394
Skew:                          -0.487   Prob(JB):                        0.183
Kurtosis:                       3.003   Cond. No.                         702.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

37.4.1. Use Stargazer#

Show multiple different models easily using Stargazer, a Python implementation of an R-package for implementing stepwise regression.

Let’s add a different model.

!pip install Stargazer
Requirement already satisfied: Stargazer in /usr/local/lib/python3.7/dist-packages (0.0.5)
from stargazer.stargazer import Stargazer, LineLocation

X2 = df[['Literacy', 'ln_Pop1831','Crime_pers',	'Crime_prop']]
X2 = sm.add_constant(X2) # adding a constant
model2 = sm.OLS(Y, X2).fit()

stargazer = Stargazer([model,model2])
stargazer
Dependent variable:Lottery
(1)(2)
Crime_pers0.000
(0.000)
Crime_prop0.002**
(0.001)
Literacy-0.489***-0.365***
(0.128)(0.133)
const246.434***215.240***
(35.233)(38.890)
ln_Pop1831-31.311***-30.191***
(5.977)(6.441)
Observations8686
R20.3480.409
Adjusted R20.3330.380
Residual Std. Error20.397 (df=83)19.667 (df=81)
F Statistic22.196*** (df=2; 83)14.003*** (df=4; 81)
Note: *p<0.1; **p<0.05; ***p<0.01

37.5. Challenge: Compare Results#

  • Explore another model of Lottery and add it to the Stargazer results.

  • Explore the stargazer documentation and customize the order of the variables, putting the constant and then the variables in all models on top (as is typically done).