Zero-inflated model

In statistics, a zero-inflated model is a statistical model based on a zero-inflated probability distribution, i.e. a distribution that allows for frequent zero-valued observations.

Zero-inflated Poisson

The first zero-inflated model is zero-inflated Poisson model. The zero-inflated Poisson model concerns a random event containing excess zero-count data in unit time.^[1] For example, the number of insurance claims within a population for a certain type of risk would be zero-inflated by those people who have not taken out insurance against the risk and thus are unable to claim. The zero-inflated Poisson (ZIP) model employs two components that correspond to two zero generating processes. The first process is governed by a binary distribution that generates structural zeros. The second process is governed by a Poisson distribution that generates counts, some of which may be zero. The two model components are described as follows:

\Pr (y_j = 0) = \pi + (1 - \pi) e^{-\lambda}

\Pr (y_j = h_i) = (1 - \pi) \frac{\lambda^{h_i} e^{-\lambda}} {h_i!},\qquad h_i \ge 1

where the outcome variable $y_j$ has any non-negative integer value, $\lambda_i$ is the expected Poisson count for the $i$ th individual; $\pi$ is the probability of extra zeros.

The mean is $(1-\pi) \lambda$ and the variance is $\lambda (1-\pi) (1+\lambda \pi)$ .

Estimators of ZIP

The method of moments estimators are given by

$\hat{\lambda}_{mo} = \frac{s^2+m^2-m}{m},$

$\hat{\pi}_{mo} = \frac{s^2 - m}{s^2 + m^2 - m},$

where $m$ is the sample mean and $s^2$ is the sample variance.

The maximum likelihood estimator^[2] can be found by solving the following equation

$\bar{x}(1- e^{-\hat{\lambda}_{ml}}) = \hat{\lambda}_{ml} \left( 1 - \frac{n_0}{n} \right).$

Where $\bar{x}$ is the sample mean, and $\frac{n_0}{n}$ is the observed proportion of zeros.

This can be solved by iteration,^[3] and the maximum likelihood estimator for $\pi$ is given by

$\hat{\pi}_{ml} = 1 - \frac{\bar{x}}{\hat{\lambda}_{ml}}.$

Related models

1994, Greene considered the zero-inflated negative binomial (ZINB) model.^[4] Daniel B. Hall adapted Lambert's methodology to an upper-bounded count situation, thereby obtaining a zero-inflated binomial (ZIB) model.^[5]

Discrete pseudo compound Poisson model

If the count data $Y$ with the feature that the probability of zero is larger than the probability of nonzero, namely

\Pr (Y = 0) > 0.5

then the discrete data $Y$ obey discrete pseudo compound Poisson distribution.^[6]

In fact, let ${G(z)}= \sum\limits_{n = 0}^\infty P(Y = n)z^n$ be the probability generating function of $y_i$ . If $p_0=\Pr (Y = 0) > 0.5$ , then $\left| {G(z)} \right| \geqslant {p_0} - \sum\limits_{i = 1}^\infty {{p_i}} = 2{p_0}-1 > 0$ . Then from Wiener–Lévy theorem,^[7] we show that ${G(z)}$ have the probability generating function of discrete pseudo compound Poisson distribution.

We say that the discrete random variable $Y$ satisfying probability generating function characterization

G_Y(z) = \sum\limits_{n = 0}^\infty P(Y = n)z^n = \exp\left(\sum\limits_{k = 1}^\infty \alpha_k \lambda (z^k - 1)\right), \quad (|z| \le 1)

has a discrete pseudo compound Poisson distribution with parameters $(\lambda_1 ,\lambda_2, \ldots )=(\alpha_1 \lambda,\alpha_2 \lambda, \ldots ) \in \mathbb{R}^\infty \left( {\sum\limits_{k = 1}^\infty {{\alpha _k}} = 1,\sum\limits_{k = 1}^\infty {\left| {{\alpha _k}} \right|} < \infty ,{\alpha _k} \in {\mathbb{R}},\lambda > 0} \right)$ .

When all the $\alpha_k$ are non-negative, it is the discrete compound Poisson distribution(non-Poisson case) with overdispersion property.

References

↑ Lambert, Diane (1992). "Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing". Technometrics 34 (1): 1–14. doi:10.2307/1269547. JSTOR 1269547.
↑ Johnson, Norman L.; Kotz, Samuel; Kemp, Adrienne W. (1992). Univariate Discrete Distributions (2nd ed.). Wiley. pp. 312–314. ISBN 0-471-54897-9.
↑ Böhning, Dankmar; Dietz, Ekkehart; Schlattmann, Peter; Mendonca, Lisette; Kirchner, Ursula (1999). "The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology". Journal of the Royal Statistical Society: Series A (Statistics in Society) (Wiley Online Library) 162 (2): 195–209. doi:10.1111/1467-985x.00130. |access-date= requires |url= (help)
↑ Greene, William H. (1994). "Some Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models". Working Paper EC-94-10: Department of Economics, New York University.
↑ Hall, Daniel B. (2000). "Zero-Inflated Poisson and Binomial Regression with Random Effects: A Case Study". Biometrics 56 (4): 1030–1039. doi:10.1111/j.0006-341X.2000.01030.x.
↑ Huiming, Zhang; Yunxiao Liu; Bo Li (2014). "Notes on discrete compound Poisson model with applications to risk theory". Insurance: Mathematics and Economics 59: 325–336. doi:10.1016/j.insmatheco.2014.09.012.
↑ Zygmund, A. (2002). Trigonometric series. Cambridge: Cambridge University Press. p. 245.

Statistics

Descriptive statistics

Continuous data

Location	Mean arithmetic geometric harmonic Median Mode

Dispersion	Range Standard deviation Coefficient of variation Percentile Interquartile range

Shape	Variance Skewness Kurtosis Moments L-moments

Count data

Index of dispersion

Summary tables

Dependence

Statistical graphics

Data collection

Study design	Effect size Standard error Statistical power Sample size determination

Survey methodology	Sampling stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Confidence interval Testing hypotheses Power

Unbiased estimators	Mean unbiased minimum-variance Median unbiased

Biased estimators	Maximum likelihood Method of moments Minimum distance Density estimation

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F Shapiro–Wilk Kolmogorov–Smirnov

Goodness of fit	Chi-squared G Sample source (Anderson–Darling) Sample normality (Shapiro–Wilk) Skewness / kurtosis normality (Jarque-Bera) Model comparison (Likelihood-ratio) Model quality (Akaike criterion)

Signed-rank	1-sample (Wilcoxon) 2-sample (Mann–Whitney U) 1-way anova (Kruskal–Wallis)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

Least squares and regression analysis

Computational statistics

Correlation and dependence

Regression analysis

Regression as a
statistical model

Linear regression	Simple linear regression Ordinary least squares Generalized least squares Weighted least squares General linear model

Predictor structure	Polynomial regression Growth curve (statistics) Segmented regression Local regression

Non-standard	Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic

Non-normal errors	Generalized linear model Binomial Poisson Logistic

Decomposition of variance

Model exploration

Background

Design of experiments

Numerical approximation