Table of Contents
Fetching ...

Learning Counterfactually Invariant Predictors

Francesco Quinzan, Cecilia Casolo, Krikamol Muandet, Yucen Luo, Niki Kilbertus

TL;DR

The paper tackles learning predictors that are invariant to counterfactual changes in the data-generating process by casting CI as a conditional independence constraint in the observational distribution, under a graphical injectivity assumption. It introduces Counterfactually Invariant Prediction (CIP), a model-agnostic framework that uses the Hilbert-Schmidt Conditional Independence Criterion (HSCIC) to enforce CI while allowing mixed data types; CI is optimized via a tunable parameter $oldsymbol{ extgamma}$ that trades accuracy for invariance. The authors provide theoretical support linking CI to conditional independence, describe practical estimation of HSCIC from samples, and validate CIP on synthetic and real datasets (including dSprites and UCI Adult) showing favorable MSE/VCF trade-offs and improved counterfactual robustness. They also discuss computational considerations, limitations (notably the need for a known causal graph and scalable CI estimation), and potential extensions to causal representation learning and broader CI notions. Overall, CIP offers a principled, kernel-based route to robust, fair, and generalizable predictors under counterfactual shifts.

Abstract

Notions of counterfactual invariance (CI) have proven essential for predictors that are fair, robust, and generalizable in the real world. We propose graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of a conditional independence in the observational distribution. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactually Invariant Prediction (CIP), building on the Hilbert-Schmidt Conditional Independence Criterion (HSCIC), a kernel-based conditional dependence measure. Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various simulated and real-world datasets including scalar and multi-variate settings.

Learning Counterfactually Invariant Predictors

TL;DR

The paper tackles learning predictors that are invariant to counterfactual changes in the data-generating process by casting CI as a conditional independence constraint in the observational distribution, under a graphical injectivity assumption. It introduces Counterfactually Invariant Prediction (CIP), a model-agnostic framework that uses the Hilbert-Schmidt Conditional Independence Criterion (HSCIC) to enforce CI while allowing mixed data types; CI is optimized via a tunable parameter that trades accuracy for invariance. The authors provide theoretical support linking CI to conditional independence, describe practical estimation of HSCIC from samples, and validate CIP on synthetic and real datasets (including dSprites and UCI Adult) showing favorable MSE/VCF trade-offs and improved counterfactual robustness. They also discuss computational considerations, limitations (notably the need for a known causal graph and scalable CI estimation), and potential extensions to causal representation learning and broader CI notions. Overall, CIP offers a principled, kernel-based route to robust, fair, and generalizable predictors under counterfactual shifts.

Abstract

Notions of counterfactual invariance (CI) have proven essential for predictors that are fair, robust, and generalizable in the real world. We propose graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of a conditional independence in the observational distribution. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactually Invariant Prediction (CIP), building on the Hilbert-Schmidt Conditional Independence Criterion (HSCIC), a kernel-based conditional dependence measure. Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various simulated and real-world datasets including scalar and multi-variate settings.
Paper Structure (55 sections, 21 theorems, 44 equations, 8 figures, 5 tables)

This paper contains 55 sections, 21 theorems, 44 equations, 8 figures, 5 tables.

Key Result

Theorem 3.1

Let $\mathcal{G}$ be a causal graph, $\mathbf{A}$, $\mathbf{W}$ be two (not necessarily disjoint) sets of nodes in $\mathcal{G}$, such that $(\mathbf{A} \cup \mathbf{W}) \cap \mathbf{Y} = \emptyset$, let $\mathbf{S}$ be a valid adjustment set for $(\mathbf{A} \cup \mathbf{W}, \mathbf{Y})$. Further,

Figures (8)

  • Figure 1: (a) Exemplary graph in which a predictor $\hat{\mathbf{Y}}$ with $\hat{\mathbf{Y}}\perp \!\!\! \perp \mathbf{A} \cup \mathbf{L} \mid \mathbf{S}$ is CI in $\mathbf{A}$ w.r.t. $\{\mathbf{L},\mathbf{S}\}$. (b)-(c) Causal and anti-causal structure from DBLP:conf/nips/VeitchDYE21 where $\mathbf{X}^{\bot}_{\mathbf{A}}$ is not causally influenced by $\mathbf{A}$, $\mathbf{X}^{\bot}_{\mathbf{Y}}$ does not causally influence $\mathbf{Y}$, and $\mathbf{X}^\wedge$ is both influenced by $\mathbf{A}$ and influences $\mathbf{Y}$. (d) Assumed causal structure for the synthetic experiments, see \ref{['sec:synthetic_experiments', 'app:synthetic']} for details. (e) Assumed causal graph for the UCI Adult dataset (\ref{['sec:fairnessresults']}), where $\mathbf{A} = \{\text{Gender, Age}\}$. (f) Causal structure for our semi-synthetic image experiments (\ref{['sec:imagedata']}), where $\mathbf{A} = \{\text{\small{Pos.X}}\}$, $\mathbf{U} = \{\text{\small{Scale}}\}$, $\mathbf{C} = \{\text{\small{Shape}}, \text{\small{Pos.Y}}\}$, $\mathbf{L} = \{\text{\small{Color}}, \text{\small{Orientation}}\}$, and $\mathbf{Y} = \{\text{\small{Outcome}}\}$.
  • Figure 2: Results on synthetic data (see \ref{['app:synthetic-experiment']} and \ref{['appendix:baselines']}). Left: trade-off between MSE and counterfactual invariance (VCF). Middle: strong correspondence between HSCIC and VCF. Right: performance of CIP against baselines CF1 and CF2 and the naive baseline. As $\gamma$ increases, CIP traces out a frontier characterizing the trade-off between MSE and CI. CF2 and the naive baseline are Pareto-dominated by CIP, i.e., we can pick $\gamma$ to outperform CF2 in both MSE and VCF simultaneously. CF1 has zero $\textsc{VCF}{}$ by design, but worse predictive performance than CIP at near zero $\textsc{VCF}{}$. Error bars are standard errors over 10 seeds.
  • Figure 3: MSE and VCF for synthetic data (\ref{['appendix:multid_setting']}) with 10- and 50-dimensional $\mathbf{A}$ for different $\gamma$ and 15 random seeds per box. CIP reliably achieves CI as $\gamma$ increases.
  • Figure 4: On the dSprites image dataset, CIP trades off MSE for VCF and achieves almost full CI as $\gamma$ increases. Boxes are for 8 random seeds.
  • Figure 5: Accuracy and VCF on the Adult dataset. CIP achieves better VCF than CF2 and the naive baseline (NB), improved in accuracy compared to PSCF and is on par with CF1 in accuracy at $\textsc{VCF}{} \approx 0$.
  • ...and 3 more figures

Theorems & Definitions (42)

  • Definition 2.1: Structural causal model (SCM)
  • Definition 2.2: Counterfactual invariance
  • Definition 3.0: valid adjustment set
  • Theorem 3.1
  • Definition 3.2: HSCIC
  • Theorem 3.3: Theorem 5.4 by Park20:CME
  • Corollary 3.4
  • Corollary 3.5
  • Definition A.1: Def. 2.1 by fawkes2022selection
  • Definition A.2: Def. 2.3 by fawkes2023results and Def. 1.1 DBLP:conf/nips/VeitchDYE21
  • ...and 32 more