Table of Contents
Fetching ...

Synthetic Potential Outcomes and Causal Mixture Identifiability

Bijan Mazaheri, Chandler Squires, Caroline Uhler

TL;DR

This work tackles heterogeneity across populations by proposing causal-response-based grouping and introducing Synthetic Potential Outcomes (SPOs) that leverage higher-order moments to synthetic-sample counterfactuals. The authors place SPOs within a four-level identifiability hierarchy, establishing conditions under which heterogeneous treatment effects (HTEs), mixtures of products, mixed treatment effects (MTEs), and average treatment effects (ATEs) are identifiable, including a novel level-3 identifiability result. The SPO framework recovers ATEs and MTEs through moment-matching using observable moments and higher-order moments of the treatment effect, employing tools such as Parafac tensor decomposition and matrix pencil/Prony methods. Empirical results on synthetic data illustrate that SPO-based recovery remains accurate across regimes where mixture-based methods degrade, and the approach supports continuous covariates, broadening applicability in causal inference with latent heterogeneity. Overall, the paper contributes a principled, scalable method for identifying and quantifying causal heterogeneity via synthetic counterfactuals, with practical implications for policy evaluation, medical interventions, and mechanism-based class analysis.

Abstract

Heterogeneous data from multiple populations, sub-groups, or sources is often represented as a ``mixture model'' with a single latent class influencing all of the observed covariates. Heterogeneity can be resolved at multiple levels by grouping populations according to different notions of similarity. This paper proposes grouping with respect to the causal response of an intervention or perturbation on the system. This definition is distinct from previous notions, such as similar covariate values (e.g. clustering) or similar correlations between covariates (e.g. Gaussian mixture models). To solve the problem, we ``synthetically sample'' from a counterfactual distribution using higher-order multi-linear moments of the observable data. To understand how these ``causal mixtures'' fit in with more classical notions, we develop a hierarchy of mixture identifiability.

Synthetic Potential Outcomes and Causal Mixture Identifiability

TL;DR

This work tackles heterogeneity across populations by proposing causal-response-based grouping and introducing Synthetic Potential Outcomes (SPOs) that leverage higher-order moments to synthetic-sample counterfactuals. The authors place SPOs within a four-level identifiability hierarchy, establishing conditions under which heterogeneous treatment effects (HTEs), mixtures of products, mixed treatment effects (MTEs), and average treatment effects (ATEs) are identifiable, including a novel level-3 identifiability result. The SPO framework recovers ATEs and MTEs through moment-matching using observable moments and higher-order moments of the treatment effect, employing tools such as Parafac tensor decomposition and matrix pencil/Prony methods. Empirical results on synthetic data illustrate that SPO-based recovery remains accurate across regimes where mixture-based methods degrade, and the approach supports continuous covariates, broadening applicability in causal inference with latent heterogeneity. Overall, the paper contributes a principled, scalable method for identifying and quantifying causal heterogeneity via synthetic counterfactuals, with practical implications for policy evaluation, medical interventions, and mechanism-based class analysis.

Abstract

Heterogeneous data from multiple populations, sub-groups, or sources is often represented as a ``mixture model'' with a single latent class influencing all of the observed covariates. Heterogeneity can be resolved at multiple levels by grouping populations according to different notions of similarity. This paper proposes grouping with respect to the causal response of an intervention or perturbation on the system. This definition is distinct from previous notions, such as similar covariate values (e.g. clustering) or similar correlations between covariates (e.g. Gaussian mixture models). To solve the problem, we ``synthetically sample'' from a counterfactual distribution using higher-order multi-linear moments of the observable data. To understand how these ``causal mixtures'' fit in with more classical notions, we develop a hierarchy of mixture identifiability.
Paper Structure (55 sections, 8 theorems, 61 equations, 3 figures, 1 algorithm)

This paper contains 55 sections, 8 theorems, 61 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

The ATE given by $\mathbb{E}(R)$, is identifiable by SPOs if

Figures (3)

  • Figure 1: HTEs are only identifiable if $U$ is observed. Identification of the next three levels require decreasingly restrictive graphical assumptions, demonstrated by the addition of an edge. $Z {\color{blue} \leftrightarrow} T$ indicates an arrow that could go either direction (or a bidirected arrow from unobserved confounding).
  • Figure 2: On the left, as we vary ${\textcolor{blue}{\mu_{zt}}}$, mixture estimation error increases but MTE estimation error is stable and close to zero. On the right, as we vary ${\textcolor{red}{\mu_{xy}}}$, MTE estimation error increases but MTE estimation error is stable and close to zero. The blue line is the average error, the shading covers one standard deviation. The dashed gray line is the mean over all parameter values.
  • Figure 3: Synthetic Potential Outcomes accurately recover the ATE (average treatment effects), as well as the decomposition of the ATE into MTEs (mixed treatment effects). In each plot, the true value is shown as a black vertical line, and the estimated values from 100 runs are shown as a histogram. See text for details.

Theorems & Definitions (12)

  • Theorem 1
  • Theorem 2
  • Theorem 3: rabani2014learning
  • Theorem 4: SPO Sample Complexity
  • Lemma 6
  • proof
  • Lemma 7
  • proof
  • Lemma 8
  • proof
  • ...and 2 more