Calibrated and Conformal Propensity Scores for Causal Effect Estimation
Shachi Deshpande, Volodymyr Kuleshov
TL;DR
This work addresses how miscalibrated propensity-score models can bias causal effect estimation in observational studies. It introduces a simple, post-hoc recalibration framework that learns a recalibrator $R$ to form $R\circ Q$, yielding calibrated treatment probabilities and improved uncertainty quantification. The paper establishes that calibration is a necessary condition for unbiased IPTW and, under reasonable conditions, for accurate AIPW estimates, and it provides error bounds that tighten as calibration improves. Empirically, calibrated propensities reduce estimation bias and variance across drug, image, and GWAS tasks, and can dramatically speed up high-dimensional analyses like GWAS by enabling faster, simpler models while maintaining accuracy.
Abstract
Propensity scores are commonly used to estimate treatment effects from observational data. We argue that the probabilistic output of a learned propensity score model should be calibrated -- i.e., a predictive treatment probability of 90% should correspond to 90% of individuals being assigned the treatment group -- and we propose simple recalibration techniques to ensure this property. We prove that calibration is a necessary condition for unbiased treatment effect estimation when using popular inverse propensity weighted and doubly robust estimators. We derive error bounds on causal effect estimates that directly relate to the quality of uncertainties provided by the probabilistic propensity score model and show that calibration strictly improves this error bound while also avoiding extreme propensity weights. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional image covariates and genome-wide association studies (GWASs). Calibrated propensity scores improve the speed of GWAS analysis by more than two-fold by enabling the use of simpler models that are faster to train.
