Propensity score matching

In the statistical analysis of observational data, propensity score matching (PSM) is a statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM attempts to reduce the bias due to confounding variables that could be found in an estimate of the treatment effect obtained from simply comparing outcomes among units that received the treatment versus those that did not. The technique was first published by Paul Rosenbaum and Donald Rubin in 1983,^[1] and implements the Rubin causal model for observational studies.

The possibility of bias arises because the apparent difference in outcome between these two groups of units may depend on characteristics that affected whether or not a unit received a given treatment instead of due to the effect of the treatment per se. In randomized experiments, the randomization enables unbiased estimation of treatment effects; for each covariate, randomization implies that treatment-groups will be balanced on average, by the law of large numbers. Unfortunately, for observational studies, the assignment of treatments to research subjects is typically not random. Matching attempts to mimic randomization by creating a sample of units that received the treatment that is comparable on all observed covariates to a sample of units that did not receive the treatment.

For example, one may be interested to know the consequences of smoking or the consequences of going to university. The people 'treated' are simply those—the smokers, or the university graduates—who in the course of everyday life undergo whatever it is that is being studied by the researcher. In both of these cases it is unfeasible (and perhaps unethical) to randomly assign people to smoking or a university education, so observational studies are required. The treatment effect estimated by simply comparing a particular outcome—rate of cancer or life time earnings—between those who smoked and did not smoke or attended university and did not attend university would be biased by any factors that predict smoking or university attendance, respectively. PSM attempts to control for these differences to make the groups receiving treatment and not-treatment more comparable.

Overview

PSM is for cases of causal inference and simple selection bias in non-experimental settings in which: (i) few units in the non-treatment comparison group are comparable to the treatment units; and (ii) selecting a subset of comparison units similar to the treatment unit is difficult because units must be compared across a high-dimensional set of pretreatment characteristics.

In normal Matching we match on single characteristics that distinguish treatment and control groups (to try to make them more alike). But If the two groups do not have substantial overlap, then substantial error may be introduced: E.g., if only the worst cases from the untreated “comparison” group are compared to only the best cases from the treatment group, the result may be regression toward the mean which may make the comparison group look better or worse than reality.

PSM employs a predicted probability of group membership e.g., treatment vs. control group—based on observed predictors, usually obtained from logistic regression to create a counterfactual group. Also propensity scores may be used for matching or as covariates—alone or with other matching variables or covariates.

General procedure

1.Run logistic regression:

Dependent variable: Y = 1, if participate; Y = 0, otherwise.
Choose appropriate confounders (variables hypothesized to be associated with both treatment and outcome)
Obtain propensity score: predicted probability (p) or log[p/(1 − p)].

2. Check that propensity score is balanced across treatment and comparison groups, and check that covariates are balanced across treatment and comparison groups within strata of the propensity score.

Use standardized differences or graphs to examine distributions

3.Match each participant to one or more nonparticipants on propensity score:

Nearest neighbor matching
Caliper matching
Mahalanobis metric matching in conjunction with PSM
Stratification matching
Difference-in-differences matching (kernel and local linear weights)
Exact matching

4. Verify that covariates are balanced across treatment and comparison groups in the matched or weighted sample

5. Multivariate analysis based on new sample

Use analyses appropriate for non-independent matched samples if more than one nonparticipant is matched to each participant

Note: When you have multiple matches for a single treated observation, it is essential to use Weighted Least Squares rather than OLS.

Formal definition

A propensity score is the probability of a unit (e.g., person, classroom, school) being assigned to a particular treatment given a set of observed covariates. Propensity scores are used to reduce selection bias by equating groups based on these covariates.

Suppose that we have a binary treatment T, an outcome Y, and background variables X. The propensity score is defined as the conditional probability of treatment given background variables:

p(x) \ \stackrel{\mathrm{def}}{=}\ \Pr(T=1 | X=x).

Let Y(0) and Y(1) denote the potential outcomes under control and treatment, respectively. Then treatment assignment is (conditionally) unconfounded if potential outcomes are independent of treatment conditional on background variables X. This can be written compactly as

Y(0), Y(1) \perp T \,|\, X

where $\perp$ denotes statistical independence.^[1]

If unconfoundedness holds, then

Y(0), Y(1) \perp T \,|\, p(X).

Judea Pearl has shown that there exists a simple graphical test, called the back-door criterion, which detects the presence of confounding variables. To estimate the effect of treatment, the background variables X must block all back-door paths in the graph. This blocking can be done either by adding the confounding variable as a control in regression, or by matching on the confounding variable.^[2]

Advantages and disadvantages

Like other matching procedures, PSM estimates an average treatment effect from observational data. The key advantages of PSM were, at the time of its introduction, that by using a linear combination of covariates for a single score, it balances treatment and control groups on a large number of covariates without losing a large number of observations. If units in the treatment and control were balanced on a large number of covariates one at a time, large numbers of observations would be needed to overcome the "dimensionality problem" whereby the introduction of a new balancing covariate increases the minimum necessary number of observations in the sample geometrically.

One disadvantage of PSM is that it only accounts for observed (and observable) covariates. Factors that affect assignment to treatment and outcome but that cannot be observed cannot be accounted for in the matching procedure.^[3] As the procedure only controls for observed variables, any hidden bias due to latent variables may remain after matching.^[4] Another issue is that PSM requires large samples, with substantial overlap between treatment and control groups.

General concerns with matching have also been raised by Judea Pearl, who has argued that hidden bias may actually increase because matching on observed variables may unleash bias due to dormant unobserved confounders. Similarly, Pearl has argued that bias reduction can only be assured (asymptotically) by modelling the qualitative causal relationships between treatment, outcome, observed and unobserved covariates.^[5] Confounding occurs when the experimenter is unable to control for alternative, non-causal explanations for an observed relationship between independent and dependent variables. Such control should satisfy the "backdoor criterion" of Pearl.^[2]

Implementations in statistics packages

R: propensity score matching is available as part of the MatchIt package.^[6]^[7] It can also easibly be implemented manually.^[8]
Stata: several commands implement propensity score matching,^[9] including the user-written psmatch2.^[10] Stata version 13 and later also offers the build-in command teffects psmatch.^[11]

References

1 2 Rosenbaum, Paul R.; Rubin, Donald B. (1983). "The Central Role of the Propensity Score in Observational Studies for Causal Effects". Biometrika 70 (1): 41–55. doi:10.1093/biomet/70.1.41.
1 2 Pearl, J. (2000). Causality: Models, Reasoning, and Inference. New York: Cambridge University Press. ISBN 0-521-77362-8.
↑ Garrido MM, et al. (2014). "Methods for Constructing and Assessing Propensity Scores". Health Services Research (Wiley) 49: 1701–20. doi:10.1111/1475-6773.12182. PMID 24779867.
↑ Shadish, W. R.; Cook, T. D.; Campbell, D. T. (2002). Experimental and Quasi-experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin. ISBN 0-395-61556-9.
↑ Pearl, J. (2009). "Understanding propensity scores". Causality: Models, Reasoning, and Inference (Second ed.). New York: Cambridge University Press. ISBN 978-0-521-89560-6.
↑ Ho, Daniel; Imai, Kosuke; King, Gary; Stuart, Elizabeth (2007). "Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference". Political Analysis 15 (3): 199–236. doi:10.1093/pan/mpl013.
↑ "MatchIt: Nonparametric Preprocessing for Parametric Casual Inference". R Project.
↑ Gelman, Andrew; Hill, Jennifer (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. New York: Cambridge University Press. pp. 206–212. ISBN 978-0-521-68689-1.
↑ Implementing Propensity Score Matching Estimators with STATA. Lecture notes 2001
↑ Leuven, E.; Sianesi, B. "PSMATCH2: Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing".
↑ "teffects psmatch — Propensity-score matching" (PDF). Stata Manual.

Study design	Effect size Standard error Statistical power Sample size determination

Survey methodology	Sampling stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Confidence interval Testing hypotheses Power

Unbiased estimators	Mean unbiased minimum-variance Median unbiased

Biased estimators	Maximum likelihood Method of moments Minimum distance Density estimation

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F Shapiro–Wilk Kolmogorov–Smirnov

Goodness of fit	Chi-squared G Sample source (Anderson–Darling) Sample normality (Shapiro–Wilk) Skewness / kurtosis normality (Jarque–Bera) Model comparison (Likelihood-ratio) Model quality (Akaike criterion)

Signed-rank	1-sample (Wilcoxon) 2-sample (Mann–Whitney U) 1-way anova (Kruskal–Wallis)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

Least squares and regression analysis

Computational statistics

Correlation and dependence

Regression analysis

Regression as a
statistical model

Linear regression	Simple linear regression Ordinary least squares Generalized least squares Weighted least squares General linear model

Predictor structure	Polynomial regression Growth curve (statistics) Segmented regression Local regression

Non-standard	Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic

Non-normal errors	Generalized linear model Binomial Poisson Logistic

Decomposition of variance

Model exploration

Background

Design of experiments

Numerical approximation