In [1]: import numpy as np
In [2]: import statsmodels.api as sm
In [3]: from scipy import stats
In [4]: from matplotlib import pyplot as plt
Example for using GLM on binomial response data the input response vector in this case is N by 2 (success, failure) This data is taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach
The response variable is (of students above the math national median, #of students below)
In [5]: data = sm.datasets.star98.load()
In [6]: data.exog = sm.add_constant(data.exog)
The response variable is (success, failure). Eg., the first observation is
In [7]: print data.endog[0]
[ 452. 355.]
Giving a total number of trials for this observation of print data.endog[0].sum()
In [8]: glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())
In [9]: binom_results = glm_binom.fit()
The fitted values are
In [10]: print binom_results.params
[-0.01681504 0.00992548 -0.01872421 -0.01423856 0.25448717 0.24069366
0.08040867 -1.9521605 -0.33408647 -0.16902217 0.0049167 -0.00357996
-0.01407656 -0.00400499 -0.0039064 0.0917143 0.04898984 0.00804074
0.00022201 -0.00224925 2.95887793]
The corresponding t-values are
In [11]: print binom_results.tvalues
[-38.74908321 16.50473627 -25.1821894 -32.81791308 8.49827113
4.21247925 5.7749976 -6.16191078 -5.45321673 -5.16865445
3.92119964 -15.87825999 -7.39093058 -8.44963886 -4.05916246
6.3210987 6.57434662 5.36229044 7.42806363 -6.44513698
1.91301155]
It is common in GLMs with interactions to compare first differences. We are interested in the difference of the impact of the explanatory variable on the response variable. This example uses interquartile differences for the percentage of low income households while holding the other values constant at their mean.
In [12]: means = data.exog.mean(axis=0)
In [13]: means25 = means.copy()
In [14]: means25[0] = stats.scoreatpercentile(data.exog[:,0], 25)
In [15]: means75 = means.copy()
In [16]: means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:,0], 75)
In [17]: resp_25 = binom_results.predict(means25)
In [18]: resp_75 = binom_results.predict(means75)
In [19]: diff = resp_75 - resp_25
The interquartile first difference for the percentage of low income households in a school district is
In [20]: print diff*100
-11.8752509918
In [21]: means0 = means.copy()
In [22]: means100 = means.copy()
In [23]: means0[0] = data.exog[:,0].min()
In [24]: means100[0] = data.exog[:,0].max()
In [25]: resp_0 = binom_results.predict(means0)
In [26]: resp_100 = binom_results.predict(means100)
In [27]: diff_full = resp_100 - resp_0
In [28]: print """The full range difference is %2.4f %%""" % (diff_full*100)
The full range difference is -36.1636 %
In [29]: nobs = binom_results.nobs
In [30]: y = data.endog[:,0]/data.endog.sum(1)
In [31]: yhat = binom_results.mu
Plot of yhat vs y
In [32]: plt.figure()
Out[32]: <matplotlib.figure.Figure at 0xd29534c>
In [33]: plt.scatter(yhat, y)
Out[33]: <matplotlib.collections.PathCollection at 0xd2c7f2c>
In [34]: line_fit = sm.OLS(y, sm.add_constant(yhat)).fit().params
In [35]: fit = lambda x: line_fit[1]+line_fit[0]*x # better way in scipy?
In [36]: plt.plot(np.linspace(0,1,nobs), fit(np.linspace(0,1,nobs)))
Out[36]: [<matplotlib.lines.Line2D at 0xcd28f8c>]
In [37]: plt.title('Model Fit Plot')
Out[37]: <matplotlib.text.Text at 0xcb9aaac>
In [38]: plt.ylabel('Observed values')
Out[38]: <matplotlib.text.Text at 0xd11c68c>
In [39]: plt.xlabel('Fitted values')
Out[39]: <matplotlib.text.Text at 0xd19a8ec>
Plot of yhat vs. Pearson residuals
In [40]: plt.figure()
Out[40]: <matplotlib.figure.Figure at 0xcb0b0ec>
In [41]: plt.scatter(yhat, binom_results.resid_pearson)
Out[41]: <matplotlib.collections.PathCollection at 0xd1f28ec>
In [42]: plt.plot([0.0, 1.0],[0.0, 0.0], 'k-')
Out[42]: [<matplotlib.lines.Line2D at 0xd1f268c>]
In [43]: plt.title('Residual Dependence Plot')
Out[43]: <matplotlib.text.Text at 0xce56d2c>
In [44]: plt.ylabel('Pearson Residuals')
Out[44]: <matplotlib.text.Text at 0xd1fa20c>
In [45]: plt.xlabel('Fitted values')
Out[45]: <matplotlib.text.Text at 0xd0b9bec>
Histogram of standardized deviance residuals
In [46]: plt.figure()
Out[46]: <matplotlib.figure.Figure at 0xcf9130c>
In [47]: res = binom_results.resid_deviance.copy()
In [48]: stdres = (res - res.mean())/res.std()
In [49]: plt.hist(stdres, bins=25)
Out[49]:
(array([ 3, 2, 4, 11, 11, 20, 27, 31, 33, 41, 28, 26, 23, 17, 11, 4, 2,
3, 1, 2, 1, 0, 1, 0, 1]),
array([-2.54284768, -2.27049293, -1.99813819, -1.72578344, -1.45342869,
-1.18107394, -0.9087192 , -0.63636445, -0.3640097 , -0.09165496,
0.18069979, 0.45305454, 0.72540929, 0.99776403, 1.27011878,
1.54247353, 1.81482827, 2.08718302, 2.35953777, 2.63189252,
2.90424726, 3.17660201, 3.44895676, 3.7213115 , 3.99366625,
4.266021 ]),
<a list of 25 Patch objects>)
In [50]: plt.title('Histogram of standardized deviance residuals')
Out[50]: <matplotlib.text.Text at 0xd12720c>
QQ Plot of Deviance Residuals
In [51]: plt.figure()
Out[51]: <matplotlib.figure.Figure at 0xcf8fd8c>
In [52]: res.sort()
In [53]: p = np.linspace(0 + 1./(nobs-1), 1-1./(nobs-1), nobs)
In [54]: quants = np.zeros_like(res)
In [55]: for i in range(nobs):
....: quants[i] = stats.scoreatpercentile(res, p[i]*100)
....:
In [56]: mu = res.mean()
In [57]: sigma = res.std()
In [58]: y = stats.norm.ppf(p, loc=mu, scale=sigma)
In [59]: plt.scatter(y, quants)
Out[59]: <matplotlib.collections.PathCollection at 0xd04322c>
In [60]: plt.plot([y.min(),y.max()],[y.min(),y.max()],'r--')
Out[60]: [<matplotlib.lines.Line2D at 0xcfba48c>]
In [61]: plt.title('Normal - Quantile Plot')
Out[61]: <matplotlib.text.Text at 0xcf1fd8c>
In [62]: plt.ylabel('Deviance Residuals Quantiles')
Out[62]: <matplotlib.text.Text at 0xcf88b8c>
In [63]: plt.xlabel('Quantiles of N(0,1)')
Out[63]: <matplotlib.text.Text at 0xcf838ec>
In [64]: from statsmodels import graphics
In [65]: img = graphics.gofplots.qqplot(res, line='r')
Example for using GLM Gamma for a proportional count response Brief description of the data and design
In [66]: print sm.datasets.scotland.DESCRLONG
This data is based on the example in Gill and describes the proportion of
voters who voted Yes to grant the Scottish Parliament taxation powers.
The data are divided into 32 council districts. This example's explanatory
variables include the amount of council tax collected in pounds sterling as
of April 1997 per two adults before adjustments, the female percentage of
total claims for unemployment benefits as of January, 1998, the standardized
mortality rate (UK is 100), the percentage of labor force participation,
regional GDP, the percentage of children aged 5 to 15, and an interaction term
between female unemployment and the council tax.
The original source files and variable information are included in
/scotland/src/
In [67]: data2 = sm.datasets.scotland.load()
In [68]: data2.exog = sm.add_constant(data2.exog)
In [69]: glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())
In [70]: glm_results = glm_gamma.fit()
Example for Gaussian distribution with a noncanonical link
In [71]: nobs2 = 100
In [72]: x = np.arange(nobs2)
In [73]: np.random.seed(54321)
In [74]: X = np.column_stack((x,x**2))
In [75]: X = sm.add_constant(X)
In [76]: lny = np.exp(-(.03*x + .0001*x**2 - 1.0)) + .001 * np.random.rand(nobs2)
In [77]: gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.log))
In [78]: gauss_log_results = gauss_log.fit()
check summary
In [79]: print binom_results.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: ['y1', 'y2'] No. Observations: 303
Model: GLM Df Residuals: 282
Model Family: Binomial Df Model: 20
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -2998.6
Date: Fri, 21 Mar 2014 Deviance: 4078.8
Time: 06:54:06 Pearson chi2: 4.05e+03
No. Iterations: 6
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 -0.0168 0.000 -38.749 0.000 -0.018 -0.016
x2 0.0099 0.001 16.505 0.000 0.009 0.011
x3 -0.0187 0.001 -25.182 0.000 -0.020 -0.017
x4 -0.0142 0.000 -32.818 0.000 -0.015 -0.013
x5 0.2545 0.030 8.498 0.000 0.196 0.313
x6 0.2407 0.057 4.212 0.000 0.129 0.353
x7 0.0804 0.014 5.775 0.000 0.053 0.108
x8 -1.9522 0.317 -6.162 0.000 -2.573 -1.331
x9 -0.3341 0.061 -5.453 0.000 -0.454 -0.214
x10 -0.1690 0.033 -5.169 0.000 -0.233 -0.105
x11 0.0049 0.001 3.921 0.000 0.002 0.007
x12 -0.0036 0.000 -15.878 0.000 -0.004 -0.003
x13 -0.0141 0.002 -7.391 0.000 -0.018 -0.010
x14 -0.0040 0.000 -8.450 0.000 -0.005 -0.003
x15 -0.0039 0.001 -4.059 0.000 -0.006 -0.002
x16 0.0917 0.015 6.321 0.000 0.063 0.120
x17 0.0490 0.007 6.574 0.000 0.034 0.064
x18 0.0080 0.001 5.362 0.000 0.005 0.011
x19 0.0002 2.99e-05 7.428 0.000 0.000 0.000
x20 -0.0022 0.000 -6.445 0.000 -0.003 -0.002
const 2.9589 1.547 1.913 0.057 -0.073 5.990
==============================================================================