Zero-inflated models

*Blog post
2021
Generalized linear model
Incomplete page
Rmarkdown graphs
Author

Steve Simon

Published

September 20, 2021

In models of counts, zeros represent a special case. Often there is more than one reason why you might observe a zero count, and you might want to model this probability separately. One approach is known as a zero inflated Poisson model.

library(broom)
library(dplyr)
library(ggplot2)
library(foreign)
library(magrittr)
library(pscl)

The zero inflated Poisson distribution

In any Poisson distribution, the probabilty of obtaining a zero is defined as

\(P[Y=0]=\frac{\lambda^0e^{-\lambda}}{0!}=e^{-\lambda}\)

Sometimes this probability is too low because there are multiple reasons why you might observe a zero. The zero-inflated Poisson model is a mixture of the Poisson random variable with a random variable that places all of its probability at zero.

\(\pi \ 0 + (1-\pi)\frac{e^{-\lambda}\lambda^y}{y!}\)

This changes the probability of zero to

\(P[Y=0]=\pi+(1-\pi)e^{-\lambda}\)

Since \(\lambda\) has to be a positive value, the probability above is greater than the Poisson probability for any value of \(\pi>0\).

The remaining probabilities are shrunk by a factor of \(1-\pi\) to compensate for the increase in the probability at zero.

You can develop a separate regression model for \(\pi\) the increase in the probability of zero in addition to the regression model for \(\lambda\), the mean of the Poisson distribution.

The zero-inflated Poisson probability distribution is not part of the exponential family, so you need separate methods for estimation. You can find a function in the pscl library that will fit a zero-inflated Poisson model.

Start with Poisson regression

It is easiest to understand the quasiPoisson model by comparing it to the Poisson model.

Francis Huang has a nice web page showing two datasets and how to analyze them using the Poisson regression model as well as several more sophisticated models. The first data set is available for anyone to download. The link I originally used no longer works.

fn <- "http://faculty.missouri.edu/huangf/data/poisson/articles.csv"
articles <- read.csv(fn)
head(articles)

Here is a different link for the same file, used at German Rodriguez’s website. I’ve placed a copy on my website in case this link breaks. Also, this dataset is available with the pscl library of R under the name, bioChemists.

fn <- "http://www.pmean.com/00files/couart2.dta"
articles <- read.dta(fn)
head(articles)
  art   fem     mar kid5  phd ment
1   0   Men Married    0 2.52    7
2   0 Women  Single    0 2.05    6
3   0 Women  Single    0 3.75    6
4   0   Men Married    1 1.18    3
5   0 Women  Single    0 3.75   26
6   0 Women Married    2 3.59    2

The variables are + gender (fem), + marital status (mar), + number of children (kid5), + prestige of graduate program (phd), + the number of articles published by the individual’s mentor (ment) + the number of articles published by the scientist (art: the outcome).

To fit a Poisson regression model in R, use the glm function with the argument, family=“poisson”.

pois1 <- glm(art ~ fem + ment + phd + mar + kid5, family=poisson, data = articles)
summary(pois1)

Call:
glm(formula = art ~ fem + ment + phd + mar + kid5, family = poisson, 
    data = articles)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.304617   0.102981   2.958   0.0031 ** 
femWomen    -0.224594   0.054613  -4.112 3.92e-05 ***
ment         0.025543   0.002006  12.733  < 2e-16 ***
phd          0.012823   0.026397   0.486   0.6271    
marMarried   0.155243   0.061374   2.529   0.0114 *  
kid5        -0.184883   0.040127  -4.607 4.08e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1817.4  on 914  degrees of freedom
Residual deviance: 1634.4  on 909  degrees of freedom
AIC: 3314.1

Number of Fisher Scoring iterations: 5

The interpretation of the coefficients should be done after exponentiating their values.

pois1 %>%
  tidy %>%
  mutate(lower_limit=estimate-1.96*std.error) %>%
  mutate(upper_limit=estimate+1.96*std.error) %>%
  select(term, estimate, lower_limit, upper_limit) %>%
  mutate(estimate=exp(estimate)) %>%
  mutate(lower_limit=exp(lower_limit)) %>%
  mutate(upper_limit=exp(upper_limit)) -> pois1_coefficients
pois1_coefficients
# A tibble: 6 × 4
  term        estimate lower_limit upper_limit
  <chr>          <dbl>       <dbl>       <dbl>
1 (Intercept)    1.36        1.11        1.66 
2 femWomen       0.799       0.718       0.889
3 ment           1.03        1.02        1.03 
4 phd            1.01        0.962       1.07 
5 marMarried     1.17        1.04        1.32 
6 kid5           0.831       0.768       0.899

Fitting a zero-inflated-Poisson model

zip1 <- zeroinfl(art ~ fem + ment + phd + mar + kid5 | 1, family=poisson, data = articles)
Warning in optim(fn = countloglikfun, gr = countgradfun, par = c(lmstart, :
unknown names in control: family
Warning in optim(fn = loglikfun, gr = gradfun, par = c(start$count, start$zero,
: unknown names in control: family
summary(zip1)

Call:
zeroinfl(formula = art ~ fem + ment + phd + mar + kid5 | 1, data = articles, 
    family = poisson)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.5296 -0.9976 -0.3026  0.5398  7.2884 

Count model coefficients (poisson with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.553995   0.113836   4.867 1.14e-06 ***
femWomen    -0.231609   0.058670  -3.948 7.89e-05 ***
ment         0.021543   0.002160   9.973  < 2e-16 ***
phd          0.002526   0.028511   0.089    0.929    
marMarried   0.131971   0.066130   1.996    0.046 *  
kid5        -0.170474   0.043296  -3.937 8.24e-05 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.6814     0.1558  -10.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 14 
Log-likelihood: -1621 on 7 Df

We cannot use the broom package to extract coefficients.

References

An earlier version of this page was published on new.pmean.com.