Table of Contents
Fetching ...

Orthogonal Causal Calibration

Justin Whitehouse, Christopher Jung, Vasilis Syrgkanis, Bryan Wilder, Zhiwei Steven Wu

TL;DR

The paper tackles calibrating heterogeneous causal effect estimates by reframing calibration as a post-processing step for standard predictive models, even when nuisance parameters are involved. It introduces two orthogonality-based frameworks: universally orthogonal losses with a simple sample-split procedure and conditionally orthogonal losses with a generalized calibration approach, each supported by finite-sample error bounds that separate nuisance estimation from calibration error. By enabling the use of off-the-shelf calibration algorithms (e.g., isotonic regression, histogram binning, Platt scaling) on generalized pseudo-outcomes, the method applies broadly to targets such as $\mathrm{CATE}$, $\mathrm{CACD}$, $\mathrm{LATE}$, and $\mathrm{CQUT}$, including conditional quantiles under treatment. Empirical results on observational 401(k) data and synthetic CQUT tasks show substantial reductions in $L^2$ calibration error and robust “do no harm” behavior, illustrating practical improvements for policy and treatment decisions. Overall, the work provides a general, plug-in calibration framework that unifies causal calibration with classical predictive calibration, enabling reliable downstream decision-making across diverse causal parameters.

Abstract

Estimates of heterogeneous treatment effects such as conditional average treatment effects (CATEs) and conditional quantile treatment effects (CQTEs) play an important role in real-world decision making. Given this importance, one should ensure these estimates are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters. In this work, we develop general algorithms for reducing the task of causal calibration to that of calibrating a standard (non-causal) predictive model. Throughout, we study a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $\ell$, under which we say an estimator $θ$ is calibrated if its predictions cannot be changed on any level set to decrease loss. For losses $\ell$ satisfying a condition called universal orthogonality, we present a simple algorithm that transforms partially-observed data into generalized pseudo-outcomes and applies any off-the-shelf calibration procedure. For losses $\ell$ satisfying a weaker assumption called conditional orthogonality, we provide a similar sample splitting algorithm the performs empirical risk minimization over an appropriately defined class of functions. Convergence of both algorithms follows from a generic, two term upper bound of the calibration error of any model. We demonstrate the practical applicability of our results in experiments on both observational and synthetic data. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.

Orthogonal Causal Calibration

TL;DR

The paper tackles calibrating heterogeneous causal effect estimates by reframing calibration as a post-processing step for standard predictive models, even when nuisance parameters are involved. It introduces two orthogonality-based frameworks: universally orthogonal losses with a simple sample-split procedure and conditionally orthogonal losses with a generalized calibration approach, each supported by finite-sample error bounds that separate nuisance estimation from calibration error. By enabling the use of off-the-shelf calibration algorithms (e.g., isotonic regression, histogram binning, Platt scaling) on generalized pseudo-outcomes, the method applies broadly to targets such as , , , and , including conditional quantiles under treatment. Empirical results on observational 401(k) data and synthetic CQUT tasks show substantial reductions in calibration error and robust “do no harm” behavior, illustrating practical improvements for policy and treatment decisions. Overall, the work provides a general, plug-in calibration framework that unifies causal calibration with classical predictive calibration, enabling reliable downstream decision-making across diverse causal parameters.

Abstract

Estimates of heterogeneous treatment effects such as conditional average treatment effects (CATEs) and conditional quantile treatment effects (CQTEs) play an important role in real-world decision making. Given this importance, one should ensure these estimates are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters. In this work, we develop general algorithms for reducing the task of causal calibration to that of calibrating a standard (non-causal) predictive model. Throughout, we study a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss , under which we say an estimator is calibrated if its predictions cannot be changed on any level set to decrease loss. For losses satisfying a condition called universal orthogonality, we present a simple algorithm that transforms partially-observed data into generalized pseudo-outcomes and applies any off-the-shelf calibration procedure. For losses satisfying a weaker assumption called conditional orthogonality, we provide a similar sample splitting algorithm the performs empirical risk minimization over an appropriately defined class of functions. Convergence of both algorithms follows from a generic, two term upper bound of the calibration error of any model. We demonstrate the practical applicability of our results in experiments on both observational and synthetic data. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.
Paper Structure (30 sections, 23 theorems, 111 equations, 2 figures, 2 tables, 6 algorithms)

This paper contains 30 sections, 23 theorems, 111 equations, 2 figures, 2 tables, 6 algorithms.

Key Result

Theorem 3.3

Let $\ell : \mathbb{R} \times \mathcal{G} \times \mathcal{Z} \rightarrow \mathbb{R}$ be universally orthogonal, per Definition def:universal. Let $g_0 \in \mathcal{G}$ denote the true nuisance parameters associated with $\ell$. Suppose $D^2_g\mathbb{E}[\partial \ell(\theta, f; Z) \mid X](g - g_0, g where $\mathrm{err}(g, h; \theta) := \sup_{f \in [g, h]}\sqrt{\mathbb{E}\left(\left\{D_g^2\mathbb{E

Figures (2)

  • Figure 1:
  • Figure 2: We plot the performance of Algorithm \ref{['alg:cross-cal-cond']} in calibrating estimates of conditional quantiles under treatment using linear calibration. We display both the sample $L^2$ calibration error and the average loss for $N \in \{500, 1000, 1500, 2000, 2500, 3000\}$ and $Q \in \{0.6, 0.75, 0.9\}$ (where an additional $N$ samples are used for calibration). We also display corresponding 95% pointwise-valid confidence intervals. Cross-calibration not only decreases calibration error (as expected), but also decreases loss.

Theorems & Definitions (44)

  • Definition 2.1
  • Definition 2.2: Classical calibration error
  • Example 2.3
  • Definition 3.1: Universal Orthogonality
  • Example 3.2
  • Theorem 3.3
  • Proposition 3.3
  • Theorem 3.4
  • Definition 4.1: Calibration Function
  • Definition 4.2: Neyman Orthogonality
  • ...and 34 more