Calibration Strategies for Robust Causal Estimation: Theoretical and Empirical Insights on Propensity Score-Based Estimators

Sven Klaassen; Jan Rabenseifner; Jannis Kueck; Philipp Bach

Calibration Strategies for Robust Causal Estimation: Theoretical and Empirical Insights on Propensity Score-Based Estimators

Sven Klaassen, Jan Rabenseifner, Jannis Kueck, Philipp Bach

TL;DR

This work tackles the robustness of propensity-score–based causal estimators under limited overlap and small samples by introducing calibration of propensity scores and integrating it with Double Machine Learning (DML). It formalizes the calibration identity $m_0(X)=\mathbb{E}[D\mid m_0(X)]$, develops calibrated scores $\tilde{m}$ via methods such as isotonic regression, Platt scaling, and Venn-Abers, and analyzes how calibration interacts with data-splitting in DML to preserve valid inference. Theoretical contributions include rates and asymptotic normality results for calibrated estimators, extensions to DML2 with various cross-fitting schemes, and explicit assumptions on calibration complexity. Empirical results from extensive simulations show that calibration reduces variance of IPW and mitigates bias in IPW while preserving the doubly robust properties of DML, especially for flexible learners; stability gains are most pronounced with tree-based propensity estimators and appropriate sample-splitting and clipping. The findings provide practical guidance on when and how to apply propensity-score calibration to improve reliability and efficiency of causal estimates in finite samples.

Abstract

The partitioning of data for estimation and calibration critically impacts the performance of propensity score based estimators like inverse probability weighting (IPW) and double/debiased machine learning (DML) frameworks. We extend recent advances in calibration techniques for propensity score estimation, improving the robustness of propensity scores in challenging settings such as limited overlap, small sample sizes, or unbalanced data. Our contributions are twofold: First, we provide a theoretical analysis of the properties of calibrated estimators in the context of DML. To this end, we refine existing calibration frameworks for propensity score models, with a particular emphasis on the role of sample-splitting schemes in ensuring valid causal inference. Second, through extensive simulations, we show that calibration reduces variance of inverse-based propensity score estimators while also mitigating bias in IPW, even in small-sample regimes. Notably, calibration improves stability for flexible learners (e.g., gradient boosting) while preserving the doubly robust properties of DML. A key insight is that, even when methods perform well without calibration, incorporating a calibration step does not degrade performance, provided that an appropriate sample-splitting approach is chosen.

Calibration Strategies for Robust Causal Estimation: Theoretical and Empirical Insights on Propensity Score-Based Estimators

TL;DR

Abstract

Calibration Strategies for Robust Causal Estimation: Theoretical and Empirical Insights on Propensity Score-Based Estimators

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (8)