Transformers Handle Endogeneity in In-Context Linear Regression

Haodong Liang; Krishnakumar Balasubramanian; Lifeng Lai

Transformers Handle Endogeneity in In-Context Linear Regression

Haodong Liang, Krishnakumar Balasubramanian, Lifeng Lai

TL;DR

This work investigates whether transformers can handle endogeneity in in-context linear regression by leveraging instrumental variables. The authors show that looped transformer architectures can implement a bi-level gradient-descent procedure that converges exponentially to the 2SLS solution and provide a theoretical excess-loss bound for in-context pretraining. Empirically, the pretrained transformer achieves performance on par with 2SLS in standard IV tasks and surpasses it under weak instruments or non-standard IV scenarios, including multicollinearity and non-linear IV effects. The results support using in-context pretraining as a robust tool for endogeneity-aware predictions and coefficient estimation, with potential real-world impact in causal inference tasks where IVs are imperfect or non-linear.

Abstract

We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares $(\textsf{2SLS})$ solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the $\textsf{2SLS}$ method, in the presence of endogeneity.

Transformers Handle Endogeneity in In-Context Linear Regression

TL;DR

Abstract

solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the

method, in the presence of endogeneity.

Paper Structure (27 sections, 11 theorems, 157 equations, 8 figures, 2 algorithms)

This paper contains 27 sections, 11 theorems, 157 equations, 8 figures, 2 algorithms.

Introduction
Related works
Endogeneity and Instrumental Variable Regression
Transformers Handle Endogeniety
Transformer Architecture
Gradient descent based IV regression
Transformers Can Efficiently Implement GD-2SLS
Pretraining and Excess Loss Bound
Extracting the regression coefficients
Experiments
Experiment Setup
Results
Conclusion
Proofs For Section \ref{['subsec:IV Regression']}
Proof of Theorem \ref{['thm:consistency']}
...and 12 more sections

Key Result

Theorem 2.1

Given Assumptions iv assumption and regularity assumption, consider clipping operation When where $K:=\frac{\lambda_{\min}(\boldsymbol{\Sigma}_z)}{6B_z^2}$ and $K_0:=\frac{\lambda_{\min}(\boldsymbol{\Sigma}_z)\sigma_{\min}^2(\boldsymbol{\Theta})}{2B_{\epsilon_2}^2}$, the mean squared error of the 2SLS estimate is bounded by: where $\boldsymbol{\Sigma}_z:=\mathbb{E}[\boldsymbol{zz}^\top],$ and

Figures (8)

Figure 1: The ICL performance of the trained transformer model in endogeneity tasks. We compare in-context prediction error (ICPE) and coefficient MSE versus (a) the number of in-context samples; (b) the IV strength. The curves are averaged over 500 simulations.
Figure 2: The ICL performance of the trained transformer model in non-standard endogeneity tasks: (a) The IV has quadratic effect on the endogenous variable; (b) The dimension of IV is not sufficient to identify the endogenous variable. The curves are averaged over 500 simulations.
Figure 3: The convergence of the GD-based 2SLS method with (a) fixed $\alpha=0.0012$ and varying $\eta$ and (b) fixed $\eta=0.01$ and varying $\alpha$.
Figure 4: The convergence of the GD-based 2SLS method with $\alpha^\star=\frac{1}{\sigma_{\max}^2(\boldsymbol{Z\hat{\Theta}})}$ and $\eta^\star=\frac{1}{\sigma_{\max}^2(\boldsymbol{Z})}$. The biases of 2SLS estimator and OLS estimator at $n=150$ are plotted for comparison.
Figure 5: The ICL performance of the trained transformer model in endogeneity tasks with multicollinearity. (a) 1 collinear column in $\boldsymbol{X}$, and 1 collinear column in $\boldsymbol{Z}$. Note that the coefficient MSEs for $\textsf{2SLS}$ and $\textsf{OLS}$ are both out of range. (b) 2 collinear columns in $\boldsymbol{X}$, and 5 collinear columns in $\boldsymbol{Z}$. We compare the performance to the $\ell _2$-regularized $\textsf{2SLS}$ and $\textsf{OLS}$ estimators. The curves are averaged over 500 simulations.
...and 3 more figures

Theorems & Definitions (26)

Definition 2.1: 2SLS estimator
Theorem 2.1: MSE of 2SLS estimator
Remark 2.1
Definition 3.1: Attention layer
Definition 3.2: MLP layer
Definition 3.3: Transformer
Definition 3.4: Looped transformer
Theorem 3.1: Implementing 2SLS with gradient-based method
Theorem 3.2: Implement a step of GD-2SLS with a transformer block
Corollary 3.1: Implementing GD-2SLS with looped transformer
...and 16 more

Transformers Handle Endogeneity in In-Context Linear Regression

TL;DR

Abstract

Transformers Handle Endogeneity in In-Context Linear Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (26)