Table of Contents
Fetching ...

CausalPFN: Amortized Causal Effect Estimation via In-Context Learning

Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Benson Li, Junwei Ma, Jesse C. Cresswell, Rahul G. Krishnan

TL;DR

CausalPFN introduces a transformer that amortizes causal effect estimation by learning from a large library of simulated DGPs that satisfy the ignorability assumption, producing CEPO-based posterior predictive distributions for new observational data. By training with a causal data-prior loss, a single model $q_\theta$ maps observed data to CEPO-PPDs, enabling zero-shot estimation of CATE and ATE with calibrated uncertainty. The approach delivers state-of-the-art average performance on CATE across IHDP, ACIC, and Lalonde, competitive ATE results, and competitive uplift modeling, while shifting the heavy posterior computation to pre-training. The authors formalize identifiability conditions ensuring consistency and provide calibration mechanisms (e.g., temperature scaling) to address epistemic uncertainty, releasing code and priors to foster adoption in practice.

Abstract

Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out of the box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model requires no further training or tuning and takes a step toward automated causal inference (https://github.com/vdblm/CausalPFN/).

CausalPFN: Amortized Causal Effect Estimation via In-Context Learning

TL;DR

CausalPFN introduces a transformer that amortizes causal effect estimation by learning from a large library of simulated DGPs that satisfy the ignorability assumption, producing CEPO-based posterior predictive distributions for new observational data. By training with a causal data-prior loss, a single model maps observed data to CEPO-PPDs, enabling zero-shot estimation of CATE and ATE with calibrated uncertainty. The approach delivers state-of-the-art average performance on CATE across IHDP, ACIC, and Lalonde, competitive ATE results, and competitive uplift modeling, while shifting the heavy posterior computation to pre-training. The authors formalize identifiability conditions ensuring consistency and provide calibration mechanisms (e.g., temperature scaling) to address epistemic uncertainty, releasing code and priors to foster adoption in practice.

Abstract

Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out of the box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model requires no further training or tuning and takes a step toward automated causal inference (https://github.com/vdblm/CausalPFN/).

Paper Structure

This paper contains 22 sections, 4 theorems, 40 equations, 14 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

Under mild regularity assumptions (assump:reg-measurabilityassump:reg-integrability in appx:theory), for almost all $\psi^\star \sim \pi$ and any set of i.i.d. samples $\mathcal{D}_\mathrm{obs}\sim P^{\psi^\star}_\mathrm{obs}$, we have that as $|\mathcal{D}_{\mathrm{obs}}|\to\infty$, if and only if the prior $\pi$ is CEPO-identifiable, that is for almost all $\psi \sim \pi$, the CEPOs $\mu_t\mlef

Figures (14)

  • Figure 1: Time vs. Performance. Comparison across 310 causal inference tasks from IHDP, ACIC, and Lalonde. CausalPFN achieves the best average rank (by precision in estimation of heterogeneous effect) while being much faster in wall-clock time from data to estimates.
  • Figure 2: Traditional Causal Inference vs. CausalPFN. (Left): A domain expert selects or tunes an estimator for a DGP that they deem appropriate for the given data. (Right): The domain expert simulates diverse DGPs for pre-training, and a transformer learns to amortize causal inference automatically.
  • Figure 3: Causal Data-Prior Training. At each iteration an index $\psi_i \sim \pi$ is sampled (left), yielding the DGP $P^{\psi_i}( \mathbf{X} ,T,\{Y_t\}_{t\in\mathcal{T}},Y)$. From this DGP we simulate an observational context $\mathcal{D}_\mathrm{obs}$ and a query $( \mathbf{x} ,t)$ with its true $\mu_t( \mathbf{x} \nonscript\;;\nonscript\;\mathopen{}\psi_i)$(center). Passing $( \mathbf{x} ,t,\mathcal{D}_\mathrm{obs})$ through the transformer predicts the CEPO‑PPD $q_\theta\mleft( \cdot\nonscript\;\middle|\nonscript\;\mathopen{} \mathbf{x} ,t,\mathcal{D}_\mathrm{obs}\mright)$(in yellow), which is derived from an implicit posterior $\pi\mleft( \cdot \nonscript\;\middle|\nonscript\;\mathopen{} \mathcal{D}_\mathrm{obs}\mright)$ that is never explicitly computed (right). We train $\theta$ to minimize the causal data‑prior loss (bottom).
  • Figure 4: Prior construction. Sample diverse base tables (OpenML or synthetic TabPFN), select covariates $X$, draw treatment $T$ with a random propensity model, select columns $\mu_0, \mu_1$ and add zero‑mean noise to form $Y_0,Y_1,$ and $Y$.
  • Figure 5: Architecture, Training, and Inference Details.(Left): An observational dataset, and a batch of queries along with their true CEPO values are sampled from the prior. Each observational row forms a context token, while query tokens consist of only the treatment and covariates. (Middle): The context and query tokens are fed into a transformer encoder with an asymmetric attention masking, where both context and query tokens attend only to the context tokens. (Bottom-Right): The output tokens are projected into a 1024-dimensional logit vector and softmaxed to form a discretized CEPO-PPD. Then, the true CEPO value corresponding to each output token is smoothed by adding narrow-width Gaussian, and training is done by minimizing the cross-entropy (histogram) loss. (Top-Right): At inference time, the CEPO-PPD mean is used as the point estimate.
  • ...and 9 more figures

Theorems & Definitions (11)

  • Definition 1: CEPO-Identifiability
  • Definition 2: CEPO-PPD
  • Proposition 1: Informal
  • Definition 3: Causal Data-Prior Loss
  • Definition 4: Observational Quotient Space
  • Theorem 2: Corollary of Doob's Consistency Theorem
  • Lemma 3
  • proof
  • Definition 5
  • Proposition 4
  • ...and 1 more