Table of Contents
Fetching ...

Analyzing limits for in-context learning

Omar Naim, Jerome Bolte, Nicholas Asher

TL;DR

This work interrogates the claim that transformer-based in-context learning implements classical learning algorithms with robust out-of-distribution generalization. Through a controlled polynomial-function task and formal analysis, it shows that ICL performance is constrained by architecture—particularly attention and Layer Normalization—leading to interpolation within training distributions and poor extrapolation. Empirically, two-layer attention-only models suffice for in-distribution learning, but generalization deteriorates under distribution shifts and across degrees of polynomials; theoretically, the authors prove that attention-only transformers cannot learn linear functions on significantly out-of-distribution inputs. These findings challenge algorithmic interpretations of ICL and point to fundamental architectural limits, with implications for how we train and design contextual learning systems.

Abstract

Our paper challenges claims from prior research that transformer-based models, when learning in context, implicitly implement standard learning algorithms. We present empirical evidence inconsistent with this view and provide a mathematical analysis demonstrating that transformers cannot achieve general predictive accuracy due to inherent architectural limitations.

Analyzing limits for in-context learning

TL;DR

This work interrogates the claim that transformer-based in-context learning implements classical learning algorithms with robust out-of-distribution generalization. Through a controlled polynomial-function task and formal analysis, it shows that ICL performance is constrained by architecture—particularly attention and Layer Normalization—leading to interpolation within training distributions and poor extrapolation. Empirically, two-layer attention-only models suffice for in-distribution learning, but generalization deteriorates under distribution shifts and across degrees of polynomials; theoretically, the authors prove that attention-only transformers cannot learn linear functions on significantly out-of-distribution inputs. These findings challenge algorithmic interpretations of ICL and point to fundamental architectural limits, with implications for how we train and design contextual learning systems.

Abstract

Our paper challenges claims from prior research that transformer-based models, when learning in context, implicitly implement standard learning algorithms. We present empirical evidence inconsistent with this view and provide a mathematical analysis demonstrating that transformers cannot achieve general predictive accuracy due to inherent architectural limitations.

Paper Structure

This paper contains 20 sections, 7 theorems, 30 equations, 4 figures, 5 tables.

Key Result

Lemma 1

The multihead attention function tends to a single linear function $x \rightarrow ax+b$, as $x \rightarrow \infty$.

Figures (4)

  • Figure 1: Evolution of error rates for various 12L8AH $d_{emb} = 256$ models with $D_{\cal F}, D_{\cal I}, D^t_I \sim {\cal U}(-1,1)$ and $D^t_F \sim {\cal U}(-\sigma, \sigma)$ for various $\sigma$, each trained from scratch on a different degree. E.g., Mn is a model trained on degree $n$ only. The black line is a predictor that yields $f(x_n) = 0, \forall f$ and $\forall x_n$. The dark red line LS represents a perfect estimator with our clean input data.
  • Figure 2: The first line of graphs shows error rates for $M_{135}$, a full 12L8AH transformer model trained on degrees 1, 3, and 5 with values and inputs sampled from $\mathcal{U}(-1,1)$; $M_n$ is the same model trained only on degree $n \in \{1,\cdots,5\}$; and $M_{135AL}$ is a 12L8AH model with only attention layers and no MLP layers. All models were tested on polynomials of degrees 1--5. The second line shows similar results for models trained by curriculum on degrees 1, 2, and 3.
  • Figure 3: Representation of the different types of training, based on the polynomials of degree 1: (1,x). (Left) $\mathcal{T}$, training on a cloud of points, (middle) $\mathcal{T}_1$ on the two principal directions of the basis and (right) $\mathcal{T}_2$ training on several directions. The rectangle represents the set of polynomials of degree 1 taking the weights in $\mathcal{U}(-1,1)$.
  • Figure 4: Emergence of boundary values in models trained and evaluated on polynomial functions. Results are presented for degree 1 (left), degree 2 (middle), and degree 3 (right).

Theorems & Definitions (7)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Corollary 1
  • Proposition 1
  • Lemma 3
  • Lemma 4