Transformers Learn Robust In-Context Regression under Distributional Uncertainty

Hoang T. H. Cao; Hai D. V. Trinh; Tho Quan; Lan V. Truong

Transformers Learn Robust In-Context Regression under Distributional Uncertainty

Hoang T. H. Cao, Hai D. V. Trinh, Tho Quan, Lan V. Truong

Abstract

Recent work has shown that Transformers can perform in-context learning for linear regression under restrictive assumptions, including i.i.d. data, Gaussian noise, and Gaussian regression coefficients. However, real-world data often violate these assumptions: the distributions of inputs, noise, and coefficients are typically unknown, non-Gaussian, and may exhibit dependency across the prompt. This raises a fundamental question: can Transformers learn effectively in-context under realistic distributional uncertainty? We study in-context learning for noisy linear regression under a broad range of distributional shifts, including non-Gaussian coefficients, heavy-tailed noise, and non-i.i.d. prompts. We compare Transformers against classical baselines that are optimal or suboptimal under the corresponding maximum-likelihood criteria. Across all settings, Transformers consistently match or outperform these baselines, demonstrating robust in-context adaptation beyond classical estimators.

Transformers Learn Robust In-Context Regression under Distributional Uncertainty

Abstract

Paper Structure (64 sections, 38 equations, 10 figures)

This paper contains 64 sections, 38 equations, 10 figures.

Introduction
Motivation
Related Work
In-Context Learning.
Beyond Gaussian Distributions.
Contributions
Systematic study of in-context learning under distributional uncertainty.
Evaluation against ML-optimal and suboptimal classical baselines.
Evidence of robust in-context adaptation beyond ordinary least squares.
Model Settings
Linear Regression Task
Coefficients ($\mathbf{w}$), Features ($\mathbf{X}$), and Noise ($\varepsilon$)
Coefficient Distributions ($\mathbf{w}$)
Laplace distribution.
Exponential distribution.
...and 49 more sections

Figures (10)

Figure 1: In-context prediction error under different coefficient priors in the noiseless setting. All methods are trained with $\ell_2$ loss. Top row: exponential (left) and Laplace (right) coefficient distributions. Bottom row: coefficients sampled uniformly from the unit hypersphere. We compare Transformer-based in-context learning against OLS, Ridge, and $\ell_1$-based solvers (LP and ADMM). Errors are reported as normalized excess loss relative to a curriculum-dependent baseline.
Figure 2: Effect of feature distribution shifts on in-context learning. Gamma features induce skewed, non-negative marginals while preserving independence across context steps. VAR(1) features introduce temporal dependence across the prompt through an autoregressive structure.
Figure 3: Robustness of in-context learning under non-Gaussian and heavy-tailed noise distributions. All models in this figure are trained and evaluated using $\ell_1$ loss. We compare Transformer-based in-context learning with classical estimators, including least squares, Ridge regression, and $\ell_1$ solvers (LP and ADMM). For Bernoulli, exponential, Gamma, and Poisson noise, the $\ell_1$ objective corresponds to the maximum-likelihood estimator. The Student-$t$ case ($\nu = 2$) is shown separately, as it admits no finite-variance likelihood and falls outside the classical ML framework.
Figure 4: Extension of Figure 2 (Gamma-distributed features). In-context learning performance across different Gamma shape and rate parameters.
Figure 5: Extension of Figure 2 (VAR(1) features). Effect of increasing temporal correlation strength across the prompt.
...and 5 more figures

Transformers Learn Robust In-Context Regression under Distributional Uncertainty

Abstract

Transformers Learn Robust In-Context Regression under Distributional Uncertainty

Authors

Abstract

Table of Contents

Figures (10)