Synthetic Potential Outcomes and Causal Mixture Identifiability
Bijan Mazaheri, Chandler Squires, Caroline Uhler
TL;DR
This work tackles heterogeneity across populations by proposing causal-response-based grouping and introducing Synthetic Potential Outcomes (SPOs) that leverage higher-order moments to synthetic-sample counterfactuals. The authors place SPOs within a four-level identifiability hierarchy, establishing conditions under which heterogeneous treatment effects (HTEs), mixtures of products, mixed treatment effects (MTEs), and average treatment effects (ATEs) are identifiable, including a novel level-3 identifiability result. The SPO framework recovers ATEs and MTEs through moment-matching using observable moments and higher-order moments of the treatment effect, employing tools such as Parafac tensor decomposition and matrix pencil/Prony methods. Empirical results on synthetic data illustrate that SPO-based recovery remains accurate across regimes where mixture-based methods degrade, and the approach supports continuous covariates, broadening applicability in causal inference with latent heterogeneity. Overall, the paper contributes a principled, scalable method for identifying and quantifying causal heterogeneity via synthetic counterfactuals, with practical implications for policy evaluation, medical interventions, and mechanism-based class analysis.
Abstract
Heterogeneous data from multiple populations, sub-groups, or sources is often represented as a ``mixture model'' with a single latent class influencing all of the observed covariates. Heterogeneity can be resolved at multiple levels by grouping populations according to different notions of similarity. This paper proposes grouping with respect to the causal response of an intervention or perturbation on the system. This definition is distinct from previous notions, such as similar covariate values (e.g. clustering) or similar correlations between covariates (e.g. Gaussian mixture models). To solve the problem, we ``synthetically sample'' from a counterfactual distribution using higher-order multi-linear moments of the observable data. To understand how these ``causal mixtures'' fit in with more classical notions, we develop a hierarchy of mixture identifiability.
