Regression with Stats-Models

rpi.analyticsdojo.com

37. Regression with Stats-Models#

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

37.1. Scikit-learn vs Stats-Models#

Scikit-learn provides framework which enables a similar api (way of interacting with codebase) for many different types of machine learning (i.e., predictive) models.
Stats-Models provices a clear set of results for statistical analsyses (understanding relationships) common to scientific (i.e., explanitory) models

#Get a sample dataset
df = sm.datasets.get_rdataset("Guerry", "HistData").data

37.2. About the Data#

See link.
Andre-Michel Guerry (1833) was the first to systematically collect and analyze social data on such things as crime, literacy and suicide with the view to determining social laws and the relations among these variables.

df.columns

Index(['dept', 'Region', 'Department', 'Crime_pers', 'Crime_prop', 'Literacy',
       'Donations', 'Infants', 'Suicides', 'MainCity', 'Wealth', 'Commerce',
       'Clergy', 'Crime_parents', 'Infanticide', 'Donation_clergy', 'Lottery',
       'Desertion', 'Instruction', 'Prostitutes', 'Distance', 'Area',
       'Pop1831'],
      dtype='object')

df

	dept	Region	Department	Crime_pers	Crime_prop	Literacy	Donations	Infants	Suicides	MainCity	Wealth	Commerce	Clergy	Crime_parents	Infanticide	Donation_clergy	Lottery	Desertion	Instruction	Prostitutes	Distance	Area	Pop1831
0	1	E	Ain	28870	15890	37	5098	33120	35039	2:Med	73	58	11	71	60	69	41	55	46	13	218.372	5762	346.03
1	2	N	Aisne	26226	5521	51	8901	14572	12831	2:Med	22	10	82	4	82	36	38	82	24	327	65.945	7369	513.00
2	3	C	Allier	26747	7925	13	10973	17044	114121	2:Med	61	66	68	46	42	76	66	16	85	34	161.927	7340	298.26
3	4	E	Basses-Alpes	12935	7289	46	2733	23018	14238	1:Sm	76	49	5	70	12	37	80	32	29	2	351.399	6925	155.90
4	5	E	Hautes-Alpes	17488	8174	69	6962	23076	16171	1:Sm	83	65	10	22	23	64	79	35	7	1	320.280	5549	129.10
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
81	86	W	Vienne	15010	4710	25	8922	35224	21851	2:Med	68	43	71	20	1	44	40	38	65	18	170.523	6990	282.73
82	87	C	Haute-Vienne	16256	6402	13	13817	19940	33497	2:Med	67	63	76	68	6	78	55	11	84	7	198.874	5520	285.13
83	88	E	Vosges	18835	9044	62	4040	14978	33029	2:Med	82	42	51	58	34	5	14	85	11	43	174.477	5874	397.99
84	89	C	Yonne	18006	6516	47	4276	16616	12789	2:Med	30	15	55	32	22	35	51	66	27	272	81.797	7427	352.49
85	200	NaN	Corse	2199	4589	49	37015	24743	37016	2:Med	37	83	1	81	2	84	83	9	25	1	539.213	8680	195.41

86 rows × 23 columns

37.3. Predicting Gambling#

Lottery Per capita wager on Royal Lottery. Ranked ratio of the proceeds bet on the royal lottery to population— Average for the years 1822-1826. (Compte rendus par le ministre des finances)
Literacy Percent Read & Write: Percent of military conscripts who can read and write.
Pop1831 Population in 1831, taken from Angeville (1836), Essai sur la Statis- tique de la Population francais, in 1000s.

#Notice this is an R style of Analsysis
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=df).fit()

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Mon, 04 Oct 2021   Prob (F-statistic):           1.90e-08
Time:                        18:55:31   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         246.4341     35.233      6.995      0.000     176.358     316.510
Literacy           -0.4889      0.128     -3.832      0.000      -0.743      -0.235
np.log(Pop1831)   -31.3114      5.977     -5.239      0.000     -43.199     -19.424
==============================================================================
Omnibus:                        3.713   Durbin-Watson:                   2.019
Prob(Omnibus):                  0.156   Jarque-Bera (JB):                3.394
Skew:                          -0.487   Prob(JB):                        0.183
Kurtosis:                       3.003   Cond. No.                         702.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

37.4. Alternate Syntax#

This is a more pure way of

df['ln_Pop1831']=np.log(df['Pop1831'])

X = df[['Literacy', 'ln_Pop1831']] # here we have 2 variables for the multiple linear regression. 
Y = df['Lottery']

X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
print_model = model.summary()
print(print_model)
    

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Mon, 04 Oct 2021   Prob (F-statistic):           1.90e-08
Time:                        18:55:32   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        246.4341     35.233      6.995      0.000     176.358     316.510
Literacy      -0.4889      0.128     -3.832      0.000      -0.743      -0.235
ln_Pop1831   -31.3114      5.977     -5.239      0.000     -43.199     -19.424
==============================================================================
Omnibus:                        3.713   Durbin-Watson:                   2.019
Prob(Omnibus):                  0.156   Jarque-Bera (JB):                3.394
Skew:                          -0.487   Prob(JB):                        0.183
Kurtosis:                       3.003   Cond. No.                         702.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

37.4.1. Use Stargazer#

Show multiple different models easily using Stargazer, a Python implementation of an R-package for implementing stepwise regression.

Let’s add a different model.

!pip install Stargazer

Requirement already satisfied: Stargazer in /usr/local/lib/python3.7/dist-packages (0.0.5)

from stargazer.stargazer import Stargazer, LineLocation

X2 = df[['Literacy', 'ln_Pop1831','Crime_pers',	'Crime_prop']]
X2 = sm.add_constant(X2) # adding a constant
model2 = sm.OLS(Y, X2).fit()

stargazer = Stargazer([model,model2])
stargazer


	Dependent variable:Lottery

	(1)	(2)

Crime_pers		0.000
		(0.000)
Crime_prop		0.002^**
		(0.001)
Literacy	-0.489^***	-0.365^***
	(0.128)	(0.133)
const	246.434^***	215.240^***
	(35.233)	(38.890)
ln_Pop1831	-31.311^***	-30.191^***
	(5.977)	(6.441)

Observations	86	86
R²	0.348	0.409
Adjusted R²	0.333	0.380
Residual Std. Error	20.397 (df=83)	19.667 (df=81)
F Statistic	22.196^*** (df=2; 83)	14.003^*** (df=4; 81)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

37.5. Challenge: Compare Results#

Explore another model of Lottery and add it to the Stargazer results.
Explore the stargazer documentation and customize the order of the variables, putting the constant and then the variables in all models on top (as is typically done).

Regression with Stats-Models

Contents

Regression with Stats-Models

rpi.analyticsdojo.com

37. Regression with Stats-Models#

37.1. Scikit-learn vs Stats-Models#

37.2. About the Data#

37.3. Predicting Gambling#

37.4. Alternate Syntax#

37.4.1. Use Stargazer#

37.5. Challenge: Compare Results#