Table of Contents
Fetching ...

Do-PFN: In-Context Learning for Causal Effect Estimation

Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, Bernhard Schölkopf

TL;DR

We address causal effect estimation from observational data without requiring a known causal graph or unconfoundedness. Do-PFN pre-trains a transformer-based PFN on synthetic SCMs to perform in-context learning for predicting conditional interventional distributions and CATEs from observational data, effectively learning to adjust for causal structure during inference. The paper provides theoretical results on optimal CID approximation under the SCM prior, analyzes sources of uncertainty, and demonstrates strong empirical performance across six synthetic case studies, RealCause, and known-graph datasets, with robust uncertainty calibration and favorable inference speed. This approach broadens access to causal-effect estimation by leveraging synthetic priors and amortized inference, offering a practical tool that remains competitive when traditional assumptions fail and scales to moderately complex causal graphs. Overall, Do-PFN shows promise as a general-purpose, efficient backbone for causal inference in tabular settings.

Abstract

Estimation of causal effects is critical to a range of scientific disciplines. Existing methods for this task either require interventional data, knowledge about the ground truth causal graph, or rely on assumptions such as unconfoundedness, restricting their applicability in real-world settings. In the domain of tabular machine learning, Prior-data fitted networks (PFNs) have achieved state-of-the-art predictive performance, having been pre-trained on synthetic data to solve tabular prediction problems via in-context learning. To assess whether this can be transferred to the harder problem of causal effect estimation, we pre-train PFNs on synthetic data drawn from a wide variety of causal structures, including interventions, to predict interventional outcomes given observational data. Through extensive experiments on synthetic case studies, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph. We also perform ablation studies that elucidate Do-PFN's scalability and robustness across datasets with a variety of causal characteristics.

Do-PFN: In-Context Learning for Causal Effect Estimation

TL;DR

We address causal effect estimation from observational data without requiring a known causal graph or unconfoundedness. Do-PFN pre-trains a transformer-based PFN on synthetic SCMs to perform in-context learning for predicting conditional interventional distributions and CATEs from observational data, effectively learning to adjust for causal structure during inference. The paper provides theoretical results on optimal CID approximation under the SCM prior, analyzes sources of uncertainty, and demonstrates strong empirical performance across six synthetic case studies, RealCause, and known-graph datasets, with robust uncertainty calibration and favorable inference speed. This approach broadens access to causal-effect estimation by leveraging synthetic priors and amortized inference, offering a practical tool that remains competitive when traditional assumptions fail and scales to moderately complex causal graphs. Overall, Do-PFN shows promise as a general-purpose, efficient backbone for causal inference in tabular settings.

Abstract

Estimation of causal effects is critical to a range of scientific disciplines. Existing methods for this task either require interventional data, knowledge about the ground truth causal graph, or rely on assumptions such as unconfoundedness, restricting their applicability in real-world settings. In the domain of tabular machine learning, Prior-data fitted networks (PFNs) have achieved state-of-the-art predictive performance, having been pre-trained on synthetic data to solve tabular prediction problems via in-context learning. To assess whether this can be transferred to the harder problem of causal effect estimation, we pre-train PFNs on synthetic data drawn from a wide variety of causal structures, including interventions, to predict interventional outcomes given observational data. Through extensive experiments on synthetic case studies, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph. We also perform ablation studies that elucidate Do-PFN's scalability and robustness across datasets with a variety of causal characteristics.

Paper Structure

This paper contains 69 sections, 3 theorems, 34 equations, 22 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Performing stochastic gradient descent according to Algorithm Algo:DataGeneration corresponds to minimizing the expected forward Kullback-Leibler divergence between the conditional interventional distribution $p(y^{in}| \mathbf{x}^{in}, do(t^{in}), \psi)$ and the distribution $q_\theta(y^{in}|do(t^{ Here, the expectation is taken with respect to the data-generating distribution defined in Algorith

Figures (22)

  • Figure 1: Do-PFN overview: Do-PFN performs in-context learning (ICL) for causal effect estimation, predicting conditional interventional distributions (CIDs) based on observational data alone. In pre-training, a large number of structural causal models (SCMs) is sampled. For each SCM, we sample an entire dataset of $M^{ob}$observational data points ${\mathcal{D}^{ob} = \{(t^{ob}_j, \mathbf{x}^{ob}_j, y^{ob}_j)\}_{j=1}^{M^{ob}}}$. We also sample $M^{in}$interventional data points ${\mathcal{D}^{in} = \{(t^{in}_k, \mathbf{x}^{in}_k, y^{in}_k)\}_{k=1}^{M^{in}}}$. To simulate inference, we input $(t^{in}, x^{in})$ along with the entire observational dataset $\mathcal{D}_{ob}$, which can have various sizes and dimensionalities. Subsequently, the transformer makes predictions $\hat{y}$, and we calculate the pre-training loss $L(\hat{y}, y^{in})$ between the predictions $\hat{y}$ and the ground truth interventional outcomes $y^{in}$. Pre-training repeats this procedure across millions of sampled SCMs to meta-learn how to perform causal inference in context. In applications, Do-PFN leverages the many simulated interventions it has seen during pre-training to predict CIDs, relying only on observational data and requiring no information about the causal graph.
  • Figure 2: Case studies: Visualization of the graph structures of our six causal case studies, requiring Do-PFN to automatically perform adjustment based on the front-door and back-door criteria. Treatment variables $t$ are visualized in orange, covariates$\mathbf{x}$ in red, and outcomes$y$ in blue. Gray variables represent unobservables, not shown to any of the methods yet influencing the generated data.
  • Figure 3: Results on synthetic data: Performance of Do-PFN in estimating conditional interventional distributions (CIDs, first row), conditional average treatment effects (CATEs, second row), and average treatment effects (ATEs, third row). Do-PFN provides strong performance across tasks.
  • Figure 4: Out-of-distribution: Analysis of Do-PFN's performance on 500 in-distribution datasets (IOD) compared to various OOD settings. Do-PFN is robust to different noise distributions (left) and various forms of functional non-linearity (middle). Do-PFN's (v1)'s performance deteriorates on larger graph sizes (right), which is recovered by Do-PFN (v1.1) via larger-scale pre-training.
  • Figure 5: Results on RealCause: Performance of Do-PFN and our causal baselines in conditional average treatment effect (CATE) estimation on the RealCause benchmark. Do-PFN provides competitive performance in these semi-synthetic, unconfounded settings.
  • ...and 17 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Proposition 3: Consistency of Do-PFN
  • proof