Marginal Causal Flows for Validation and Inference
Daniel de Vassimon Manela, Laura Battaglia, Robin J. Evans
TL;DR
Frugal Flows address the challenge of validating causal inference methods by directly parameterising the marginal causal effect within a flexible, likelihood-based framework. By combining normalising flows for the data-generating process with copula-based conditioning, FFs learn the marginal outcome distribution under do(T) and the associated causal margin, while enabling exact specification of unobserved confounding and treatment heterogeneity. This yields realistic synthetic benchmarks that closely resemble real-world data yet encode user-defined causal properties, improving robustness checks for causal methods. The approach advances causal benchmarking by providing precise control over ATE, confounding, and overlap, with demonstrated benefits in simulated and real-data experiments, at the cost of higher data and tuning requirements. FFs thus offer a principled, configurable platform for validating and stress-testing causal inference methods in complex data settings.
Abstract
Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging due to the inflexibility of employed models and the lack of complexity in causal benchmark datasets, which often fail to reproduce intricate real-world data patterns. In this paper we introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process, while also directly inferring the marginal causal quantities from observational data. We propose that these models are exceptionally well suited for generating synthetic data to validate causal methods. They can create synthetic datasets that closely resemble the empirical dataset, while automatically and exactly satisfying a user-defined average treatment effect. To our knowledge, Frugal Flows are the first generative model to both learn flexible data representations and also exactly parameterise quantities such as the average treatment effect and the degree of unobserved confounding. We demonstrate the above with experiments on both simulated and real-world datasets.
