Table of Contents
Fetching ...

Transformers Learn Robust In-Context Regression under Distributional Uncertainty

Hoang T. H. Cao, Hai D. V. Trinh, Tho Quan, Lan V. Truong

Abstract

Recent work has shown that Transformers can perform in-context learning for linear regression under restrictive assumptions, including i.i.d. data, Gaussian noise, and Gaussian regression coefficients. However, real-world data often violate these assumptions: the distributions of inputs, noise, and coefficients are typically unknown, non-Gaussian, and may exhibit dependency across the prompt. This raises a fundamental question: can Transformers learn effectively in-context under realistic distributional uncertainty? We study in-context learning for noisy linear regression under a broad range of distributional shifts, including non-Gaussian coefficients, heavy-tailed noise, and non-i.i.d. prompts. We compare Transformers against classical baselines that are optimal or suboptimal under the corresponding maximum-likelihood criteria. Across all settings, Transformers consistently match or outperform these baselines, demonstrating robust in-context adaptation beyond classical estimators.

Transformers Learn Robust In-Context Regression under Distributional Uncertainty

Abstract

Recent work has shown that Transformers can perform in-context learning for linear regression under restrictive assumptions, including i.i.d. data, Gaussian noise, and Gaussian regression coefficients. However, real-world data often violate these assumptions: the distributions of inputs, noise, and coefficients are typically unknown, non-Gaussian, and may exhibit dependency across the prompt. This raises a fundamental question: can Transformers learn effectively in-context under realistic distributional uncertainty? We study in-context learning for noisy linear regression under a broad range of distributional shifts, including non-Gaussian coefficients, heavy-tailed noise, and non-i.i.d. prompts. We compare Transformers against classical baselines that are optimal or suboptimal under the corresponding maximum-likelihood criteria. Across all settings, Transformers consistently match or outperform these baselines, demonstrating robust in-context adaptation beyond classical estimators.
Paper Structure (64 sections, 38 equations, 10 figures)

This paper contains 64 sections, 38 equations, 10 figures.

Figures (10)

  • Figure 1: In-context prediction error under different coefficient priors in the noiseless setting. All methods are trained with $\ell_2$ loss. Top row: exponential (left) and Laplace (right) coefficient distributions. Bottom row: coefficients sampled uniformly from the unit hypersphere. We compare Transformer-based in-context learning against OLS, Ridge, and $\ell_1$-based solvers (LP and ADMM). Errors are reported as normalized excess loss relative to a curriculum-dependent baseline.
  • Figure 2: Effect of feature distribution shifts on in-context learning. Gamma features induce skewed, non-negative marginals while preserving independence across context steps. VAR(1) features introduce temporal dependence across the prompt through an autoregressive structure.
  • Figure 3: Robustness of in-context learning under non-Gaussian and heavy-tailed noise distributions. All models in this figure are trained and evaluated using $\ell_1$ loss. We compare Transformer-based in-context learning with classical estimators, including least squares, Ridge regression, and $\ell_1$ solvers (LP and ADMM). For Bernoulli, exponential, Gamma, and Poisson noise, the $\ell_1$ objective corresponds to the maximum-likelihood estimator. The Student-$t$ case ($\nu = 2$) is shown separately, as it admits no finite-variance likelihood and falls outside the classical ML framework.
  • Figure 4: Extension of Figure 2 (Gamma-distributed features). In-context learning performance across different Gamma shape and rate parameters.
  • Figure 5: Extension of Figure 2 (VAR(1) features). Effect of increasing temporal correlation strength across the prompt.
  • ...and 5 more figures