Orthogonal Causal Calibration

Justin Whitehouse; Christopher Jung; Vasilis Syrgkanis; Bryan Wilder; Zhiwei Steven Wu

Orthogonal Causal Calibration

Justin Whitehouse, Christopher Jung, Vasilis Syrgkanis, Bryan Wilder, Zhiwei Steven Wu

TL;DR

The paper tackles calibrating heterogeneous causal effect estimates by reframing calibration as a post-processing step for standard predictive models, even when nuisance parameters are involved. It introduces two orthogonality-based frameworks: universally orthogonal losses with a simple sample-split procedure and conditionally orthogonal losses with a generalized calibration approach, each supported by finite-sample error bounds that separate nuisance estimation from calibration error. By enabling the use of off-the-shelf calibration algorithms (e.g., isotonic regression, histogram binning, Platt scaling) on generalized pseudo-outcomes, the method applies broadly to targets such as $\mathrm{CATE}$, $\mathrm{CACD}$, $\mathrm{LATE}$, and $\mathrm{CQUT}$, including conditional quantiles under treatment. Empirical results on observational 401(k) data and synthetic CQUT tasks show substantial reductions in $L^2$ calibration error and robust “do no harm” behavior, illustrating practical improvements for policy and treatment decisions. Overall, the work provides a general, plug-in calibration framework that unifies causal calibration with classical predictive calibration, enabling reliable downstream decision-making across diverse causal parameters.

Abstract

Estimates of heterogeneous treatment effects such as conditional average treatment effects (CATEs) and conditional quantile treatment effects (CQTEs) play an important role in real-world decision making. Given this importance, one should ensure these estimates are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters. In this work, we develop general algorithms for reducing the task of causal calibration to that of calibrating a standard (non-causal) predictive model. Throughout, we study a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $\ell$, under which we say an estimator $θ$ is calibrated if its predictions cannot be changed on any level set to decrease loss. For losses $\ell$ satisfying a condition called universal orthogonality, we present a simple algorithm that transforms partially-observed data into generalized pseudo-outcomes and applies any off-the-shelf calibration procedure. For losses $\ell$ satisfying a weaker assumption called conditional orthogonality, we provide a similar sample splitting algorithm the performs empirical risk minimization over an appropriately defined class of functions. Convergence of both algorithms follows from a generic, two term upper bound of the calibration error of any model. We demonstrate the practical applicability of our results in experiments on both observational and synthetic data. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.

Orthogonal Causal Calibration

TL;DR

, and

, including conditional quantiles under treatment. Empirical results on observational 401(k) data and synthetic CQUT tasks show substantial reductions in

calibration error and robust “do no harm” behavior, illustrating practical improvements for policy and treatment decisions. Overall, the work provides a general, plug-in calibration framework that unifies causal calibration with classical predictive calibration, enabling reliable downstream decision-making across diverse causal parameters.

Abstract

, under which we say an estimator

is calibrated if its predictions cannot be changed on any level set to decrease loss. For losses

satisfying a condition called universal orthogonality, we present a simple algorithm that transforms partially-observed data into generalized pseudo-outcomes and applies any off-the-shelf calibration procedure. For losses

satisfying a weaker assumption called conditional orthogonality, we provide a similar sample splitting algorithm the performs empirical risk minimization over an appropriately defined class of functions. Convergence of both algorithms follows from a generic, two term upper bound of the calibration error of any model. We demonstrate the practical applicability of our results in experiments on both observational and synthetic data. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.

Paper Structure (30 sections, 23 theorems, 111 equations, 2 figures, 2 tables, 6 algorithms)

This paper contains 30 sections, 23 theorems, 111 equations, 2 figures, 2 tables, 6 algorithms.

Introduction
Our Contributions
Related Work
Calibration:
Double/debiased Machine Learning:
Calibration of Causal Effects
Calibration for Universally Orthogonal Losses
Sample Splitting Algorithm
Calibration for Conditionally Orthogonal Losses
A General Bound on Calibration Error
A Sample Splitting Algorithm
Experiments
Effects of 401(k) Participation/Eligibility on Financial Assets
Model Training and Calibration:
Comparing Calibration Error in Quartiles
...and 15 more sections

Key Result

Theorem 3.3

Let $\ell : \mathbb{R} \times \mathcal{G} \times \mathcal{Z} \rightarrow \mathbb{R}$ be universally orthogonal, per Definition def:universal. Let $g_0 \in \mathcal{G}$ denote the true nuisance parameters associated with $\ell$. Suppose $D^2_g\mathbb{E}[\partial \ell(\theta, f; Z) \mid X](g - g_0, g where $\mathrm{err}(g, h; \theta) := \sup_{f \in [g, h]}\sqrt{\mathbb{E}\left(\left\{D_g^2\mathbb{E

Figures (2)

Figure 1:
Figure 2: We plot the performance of Algorithm \ref{['alg:cross-cal-cond']} in calibrating estimates of conditional quantiles under treatment using linear calibration. We display both the sample $L^2$ calibration error and the average loss for $N \in \{500, 1000, 1500, 2000, 2500, 3000\}$ and $Q \in \{0.6, 0.75, 0.9\}$ (where an additional $N$ samples are used for calibration). We also display corresponding 95% pointwise-valid confidence intervals. Cross-calibration not only decreases calibration error (as expected), but also decreases loss.

Theorems & Definitions (44)

Definition 2.1
Definition 2.2: Classical calibration error
Example 2.3
Definition 3.1: Universal Orthogonality
Example 3.2
Theorem 3.3
Proposition 3.3
Theorem 3.4
Definition 4.1: Calibration Function
Definition 4.2: Neyman Orthogonality
...and 34 more

Orthogonal Causal Calibration

TL;DR

Abstract

Orthogonal Causal Calibration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (44)