Towards Causal Foundation Model: on Duality between Causal Inference and Attention

Jiaqi Zhang; Joel Jennings; Agrin Hilmkil; Nick Pawlowski; Cheng Zhang; Chao Ma

Towards Causal Foundation Model: on Duality between Causal Inference and Attention

Jiaqi Zhang, Joel Jennings, Agrin Hilmkil, Nick Pawlowski, Cheng Zhang, Chao Ma

TL;DR

The paper addresses the challenge of causal inference within foundation-model frameworks by proposing CInA, a self-supervised method that uses multiple unlabeled datasets to learn how to estimate treatment effects. It establishes a primal-dual link between covariate balancing and self-attention, showing that optimal balancing weights can be recovered via a transformer’s final layer. A gradient-based algorithm enables zero-shot causal inference on unseen datasets after training on multiple data sources, with both single-dataset and multi-dataset variants demonstrated. Empirical results on synthetic and real-world data show competitive accuracy with traditional per-dataset methods and substantial gains in inference speed, suggesting CInA as a stepping stone toward causally-aware foundation models.

Abstract

Foundation models have brought changes to the landscape of machine learning, demonstrating sparks of human-level intelligence across a diverse array of tasks. However, a gap persists in complex tasks such as causal inference, primarily due to challenges associated with intricate reasoning steps and high numerical precision requirements. In this work, we take a first step towards building causally-aware foundation models for treatment effect estimations. We propose a novel, theoretically justified method called Causal Inference with Attention (CInA), which utilizes multiple unlabeled datasets to perform self-supervised causal learning, and subsequently enables zero-shot causal inference on unseen tasks with new data. This is based on our theoretical results that demonstrate the primal-dual connection between optimal covariate balancing and self-attention, facilitating zero-shot causal inference through the final layer of a trained transformer-type architecture. We demonstrate empirically that CInA effectively generalizes to out-of-distribution datasets and various real-world datasets, matching or even surpassing traditional per-dataset methodologies. These results provide compelling evidence that our method has the potential to serve as a stepping stone for the development of causal foundation models.

Towards Causal Foundation Model: on Duality between Causal Inference and Attention

TL;DR

Abstract

Paper Structure (31 sections, 3 theorems, 41 equations, 6 figures, 3 tables, 4 algorithms)

This paper contains 31 sections, 3 theorems, 41 equations, 6 figures, 3 tables, 4 algorithms.

Introduction
Related Works
Establishing Duality Between Causality and Attention
Adversarial Covariate Balancing as Dual SVM
Self-attention as Support Vector Expansion
Practical Algorithms Towards Causal Foundation Models
Gradient-based Optimal Balancing via Self-Attention
Zero-shot Causal Inference under Multi-dataset Setting
Computational Complexity
Experiments
Simulation Study A: fixed causal graph
Simulation Study B: Multiple Causal Graphs
Empirical Studies on Real-world Datasets
Discussion
Discussion on the Definition of (Causal) Foundation Models
...and 16 more sections

Key Result

Theorem 1

Under mild regularities on ${\bm{X}}$, learning a self-attention via gradient-based Algorithm alg:single-dataset recovers the optimal covariate balancing weight at the global minimum of the penalized hinge loss in Eq. (eq:loss-func).

Figures (6)

Figure 1: Attending to units instead of words. Values correspond to covariate balancing weights.
Figure 2: CInA (multi-dataset) forward pass.
Figure 3: MAE for Simulation A. CINA matches the best learning-based method DML; CINA (ZS) generalizes well in moderate settings.
Figure 4: MAEs for ER-5000. CINA and CINA (ZS) match the best reference method, where CINA (ZS-S) improves upon CINA (ZS) with additional supervised signals.
Figure 5: MAE for real-world datasets. CInA outperforms the majority of baselines in most cases: it achieves the best average ranking of 1.83, whereas the second-best is DML with an average ranking of 3. CInA (ZS) generalizes well and returns the best result for ACIC.
...and 1 more figures

Theorems & Definitions (6)

Theorem 1: Duality between covariate balancing and self-attention
Lemma 1
proof
Theorem : Theorem 1
proof
Remark 1

Towards Causal Foundation Model: on Duality between Causal Inference and Attention

TL;DR

Abstract

Towards Causal Foundation Model: on Duality between Causal Inference and Attention

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)