Table of Contents
Fetching ...

Transformers Handle Endogeneity in In-Context Linear Regression

Haodong Liang, Krishnakumar Balasubramanian, Lifeng Lai

TL;DR

This work investigates whether transformers can handle endogeneity in in-context linear regression by leveraging instrumental variables. The authors show that looped transformer architectures can implement a bi-level gradient-descent procedure that converges exponentially to the 2SLS solution and provide a theoretical excess-loss bound for in-context pretraining. Empirically, the pretrained transformer achieves performance on par with 2SLS in standard IV tasks and surpasses it under weak instruments or non-standard IV scenarios, including multicollinearity and non-linear IV effects. The results support using in-context pretraining as a robust tool for endogeneity-aware predictions and coefficient estimation, with potential real-world impact in causal inference tasks where IVs are imperfect or non-linear.

Abstract

We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares $(\textsf{2SLS})$ solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the $\textsf{2SLS}$ method, in the presence of endogeneity.

Transformers Handle Endogeneity in In-Context Linear Regression

TL;DR

This work investigates whether transformers can handle endogeneity in in-context linear regression by leveraging instrumental variables. The authors show that looped transformer architectures can implement a bi-level gradient-descent procedure that converges exponentially to the 2SLS solution and provide a theoretical excess-loss bound for in-context pretraining. Empirically, the pretrained transformer achieves performance on par with 2SLS in standard IV tasks and surpasses it under weak instruments or non-standard IV scenarios, including multicollinearity and non-linear IV effects. The results support using in-context pretraining as a robust tool for endogeneity-aware predictions and coefficient estimation, with potential real-world impact in causal inference tasks where IVs are imperfect or non-linear.

Abstract

We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the method, in the presence of endogeneity.
Paper Structure (27 sections, 11 theorems, 157 equations, 8 figures, 2 algorithms)

This paper contains 27 sections, 11 theorems, 157 equations, 8 figures, 2 algorithms.

Key Result

Theorem 2.1

Given Assumptions iv assumption and regularity assumption, consider clipping operation When where $K:=\frac{\lambda_{\min}(\boldsymbol{\Sigma}_z)}{6B_z^2}$ and $K_0:=\frac{\lambda_{\min}(\boldsymbol{\Sigma}_z)\sigma_{\min}^2(\boldsymbol{\Theta})}{2B_{\epsilon_2}^2}$, the mean squared error of the 2SLS estimate is bounded by: where $\boldsymbol{\Sigma}_z:=\mathbb{E}[\boldsymbol{zz}^\top],$ and

Figures (8)

  • Figure 1: The ICL performance of the trained transformer model in endogeneity tasks. We compare in-context prediction error (ICPE) and coefficient MSE versus (a) the number of in-context samples; (b) the IV strength. The curves are averaged over 500 simulations.
  • Figure 2: The ICL performance of the trained transformer model in non-standard endogeneity tasks: (a) The IV has quadratic effect on the endogenous variable; (b) The dimension of IV is not sufficient to identify the endogenous variable. The curves are averaged over 500 simulations.
  • Figure 3: The convergence of the GD-based 2SLS method with (a) fixed $\alpha=0.0012$ and varying $\eta$ and (b) fixed $\eta=0.01$ and varying $\alpha$.
  • Figure 4: The convergence of the GD-based 2SLS method with $\alpha^\star=\frac{1}{\sigma_{\max}^2(\boldsymbol{Z\hat{\Theta}})}$ and $\eta^\star=\frac{1}{\sigma_{\max}^2(\boldsymbol{Z})}$. The biases of 2SLS estimator and OLS estimator at $n=150$ are plotted for comparison.
  • Figure 5: The ICL performance of the trained transformer model in endogeneity tasks with multicollinearity. (a) 1 collinear column in $\boldsymbol{X}$, and 1 collinear column in $\boldsymbol{Z}$. Note that the coefficient MSEs for $\textsf{2SLS}$ and $\textsf{OLS}$ are both out of range. (b) 2 collinear columns in $\boldsymbol{X}$, and 5 collinear columns in $\boldsymbol{Z}$. We compare the performance to the $\ell _2$-regularized $\textsf{2SLS}$ and $\textsf{OLS}$ estimators. The curves are averaged over 500 simulations.
  • ...and 3 more figures

Theorems & Definitions (26)

  • Definition 2.1: 2SLS estimator
  • Theorem 2.1: MSE of 2SLS estimator
  • Remark 2.1
  • Definition 3.1: Attention layer
  • Definition 3.2: MLP layer
  • Definition 3.3: Transformer
  • Definition 3.4: Looped transformer
  • Theorem 3.1: Implementing 2SLS with gradient-based method
  • Theorem 3.2: Implement a step of GD-2SLS with a transformer block
  • Corollary 3.1: Implementing GD-2SLS with looped transformer
  • ...and 16 more