Table of Contents
Fetching ...

Asymptotic theory of in-context learning by linear attention

Yue M. Lu, Mary I. Letey, Jacob A. Zavatone-Veth, Anindita Maiti, Cengiz Pehlevan

TL;DR

This work addresses the question of when in-context learning (ICL) emerges in Transformer-like architectures by analyzing an exactly solvable linear-attention model for linear regression. It develops a sharp joint-asymptotic theory in a scaling limit with $ rac{ℓ}{d}=oldsymbol{α}$, $ rac{k}{d}=oldsymbol{κ}$, and $ rac{n}{d^2}=oldsymbol{τ}$, deriving deterministic generalization-error curves $e^{ICL}(oldsymbol{τ},oldsymbol{α},oldsymbol{κ},ρ, ext{λ})$ and $e^{IDG}(oldsymbol{τ},oldsymbol{α},oldsymbol{κ},ρ, ext{λ})$ that reveal a double-descent phenomenon and a phase transition as task diversity grows. The results show memorization vs. genuine ICL behavior separated by a diversity threshold near $oldsymbol{κ}=1$, and demonstrate non-monotone dependence of errors on context length $oldsymbol{α}$, with sharp predictions confirmed by experiments on both linear and nonlinear Transformers. The findings provide mechanistic insight into how pretraining data size, context length, and task diversity control ICL and its generalization, and they suggest that similar scaling laws extend to full Transformer architectures. Overall, the work connects random-matrix theory to practical questions about pretraining and context-based learning, offering principled guidance for designing pretraining curricula to induce robust ICL. The conclusions indicate that in-context learning can arise without full Transformer complexity, while also showing when such capabilities robustly generalize to new tasks in a principled, scalable regime.

Abstract

Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.

Asymptotic theory of in-context learning by linear attention

TL;DR

This work addresses the question of when in-context learning (ICL) emerges in Transformer-like architectures by analyzing an exactly solvable linear-attention model for linear regression. It develops a sharp joint-asymptotic theory in a scaling limit with , , and , deriving deterministic generalization-error curves and that reveal a double-descent phenomenon and a phase transition as task diversity grows. The results show memorization vs. genuine ICL behavior separated by a diversity threshold near , and demonstrate non-monotone dependence of errors on context length , with sharp predictions confirmed by experiments on both linear and nonlinear Transformers. The findings provide mechanistic insight into how pretraining data size, context length, and task diversity control ICL and its generalization, and they suggest that similar scaling laws extend to full Transformer architectures. Overall, the work connects random-matrix theory to practical questions about pretraining and context-based learning, offering principled guidance for designing pretraining curricula to induce robust ICL. The conclusions indicate that in-context learning can arise without full Transformer complexity, while also showing when such capabilities robustly generalize to new tasks in a principled, scalable regime.

Abstract

Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.
Paper Structure (38 sections, 5 theorems, 243 equations, 6 figures)

This paper contains 38 sections, 5 theorems, 243 equations, 6 figures.

Key Result

Lemma 1

Let the task vector $w \in \mathcal{S}^{d-1}(\sqrt{d})$ be fixed. We have Moreover, and

Figures (6)

  • Figure 1: ICL performance as a function of $\tau$: theory (solid lines) vs simulations (dots). Plots of (a.) $e^\mathrm{IDG}_\mathrm{ridgeless}(\tau,\alpha,\kappa,\rho)$, (b.) $e^\mathrm{ICL}_\mathrm{ridgeless}(\tau,\alpha,\kappa,\rho)$, and (c.) $e^\mathrm{ICL}(\tau,\alpha,\kappa,\rho,\lambda)$ against $\tau$. Simulated errors calculated by evaluating the corresponding test error on the corresponding optimized $\Gamma^*$. Parameters:$d=100$, $\rho = 0.01$ for all; (a.), (b.) $\kappa=0.5$, (c.) $\alpha = 10, \kappa = \infty$. Averages and standard deviations are computed over 10 runs.
  • Figure 2: Error curves as functions of $\alpha$: theory (solid lines) vs simulations (dots). Plots of (a.) $e^\mathrm{IDG}_\mathrm{ridgeless}$ and (b.), (c.) $e^\mathrm{ICL}_\mathrm{ridgeless}$ against $\alpha$. Hollowed markers, when plotted, indicate $\alpha^\ast$ value minimizing error if it exists. Parameters:$d = 100$; (a.) $\tau=0.5,\rho=0.5$; (b.) $\tau=0.5,\rho=0.1$; (c.) $\tau = 20, \rho=0.5$. Averages and standard deviations are computed over 20-100 runs.
  • Figure 3: Phase diagram of non-monotonicity in $\alpha$ of ridgeless IDG and ICL error, for fixed $\tau$ against task diversity $\kappa$ and label noise $\rho$. The colormap corresponds to the value $\alpha^\ast$ that minimizes error in $\alpha$, if it exists, at that particular $\kappa, \rho$ pair; grey is plotted if the error curve is monotonic at that $\kappa, \rho$ pair. Figures (a.), (b.), (c.) in this plot correspond to the respective setup in Figure \ref{['fig:linear_alpha']}, and points A-E correspond to the respective curve in Figure \ref{['fig:linear_alpha']}. The dashed vertical lines are plotted at $\kappa = \text{min}(\tau,1)$. We know from \ref{['eq:lambdaalphalimit_tsmall']} and surrounding discussion that ICL error diverges in $\alpha$ for all $\kappa,\rho$ to the left of this line, and IDG error diverges on this line for $\tau < 1$.
  • Figure 4: Plots of $g_\text{task}$ against $\kappa$ for linear transformer model vs dMMSE estimator. (a.) has plots on log-log scale highlighting the substantial difference in their rates of convergence towards 0. Dashed and solid lines are theory predictions; dots and squares are numerical simulation. (b.) has plots on log-linear scale with the addition of the green $\alpha\to\infty$ curve given by \ref{['eq:proportionallimit']} demonstrating the phase transition at $\kappa = 1$. Parameters: Linear transformer: $d = 100 - 140$, $\tau = 0.2\alpha$ throughout, simulations computed over 20 runs; dMMSE: $d=100$, simulations computed over 1000-5000 runs
  • Figure 5: Experimental verification in full linear attention (a.) (full $K,Q,V$ matrices) and nonlinear models (b.) (one block softmax attention only), (c.) (two blocks of softmax attention and 1-layer MLP), of both scaling definition for $\tau$ and double descent behavior in $n$. (a.), (b.), (c.) show error curves against $\tau$ for various architectures, consistent across token dimension $d = 20,40,80$. Some deviations arise near double descent in the linear model (aa), possibly due to parameter inversion or instabilities, and for large $\tau$ in the deep nonlinear model (c.), possibly due to the training procedure. Double-descent phenomena are confirmed: increasing $n$ will increase error until an interpolation threshold is reached. colored dashed lines indicate experimental interpolation threshold for that architecture and $d$ configuration. (d.) shows that the location of the interpolation threshold occurs for $n$ proportional to $d^2$, as predicted by the linear theory. Dots are experimental interpolation thresholds for various architectures, and dashed lines are best fit curves correspond to fitting $\log(n) = a\log(d) +b$, each with $a\approx 2$ (explicitly, $a_{\text{full linear}} = 1.87$, $a_{\text{softmax}} = 1.66$, $a_{\text{2 blocks}} = 2.13$, $a_{\text{3 blocks}} = 2.08$). Interpolation threshold was computed empirically by searching for location in $\tau$ of sharp increase in value and variance of training error at a fixed number of gradient steps. Parameters:$\alpha = 1, \kappa = \infty, \rho = 0.01$. For (a.), (b.), and (c.): variance shown comes from model trained over different samples of pretraining data; lines show averages over 10 runs and shaded region shows standard deviation.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Lemma 1: Conditional moments
  • proof
  • Proposition 1: Generalization error
  • Remark 1
  • proof
  • Corollary 1
  • proof
  • Remark 2
  • Lemma 2
  • proof
  • ...and 9 more