Table of Contents
Fetching ...

Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers

Nischal Mainali, Lucas Teixeira

TL;DR

This work derives an exact, non-asymptotic SGD dynamics for a linear transformer performing in-context linear regression, revealing a clear separation of learning timescales along data-eigenmodes dictated by the input covariance spectrum. It shows that learning proceeds via mode-specific, nonlinear trajectories with a conserved quantity restricting dynamics, and the learned computation at convergence effectively preconditions inputs to recover the target linear map. The authors extend insights from solvable deep linear networks to propose macroscopic diagnostics (spectral rank dynamics, subspace stability, curvature-based loss analysis) and demonstrate qualitative parallels in non-linear, multi-layer transformers, including sudden ICL emergence and grokking phenomena. The results offer a principled analytic framework to interpret transformer training, with potential applications to interpretability, monitoring, and extension to nonlinear attention architectures and more complex tasks.

Abstract

Transformer models exhibit remarkable in-context learning (ICL), adapting to novel tasks from examples within their context, yet the underlying mechanisms remain largely mysterious. Here, we provide an exact analytical characterization of ICL emergence by deriving the closed-form stochastic gradient descent (SGD) dynamics for a simplified linear transformer performing regression tasks. Our analysis reveals key properties: (1) a natural separation of timescales directly governed by the input data's covariance structure, leading to staged learning; (2) an exact description of how ICL develops, including fixed points corresponding to learned algorithms and conservation laws constraining the dynamics; and (3) surprisingly nonlinear learning behavior despite the model's linearity. We hypothesize this phenomenology extends to non-linear models. To test this, we introduce theory-inspired macroscopic measures (spectral rank dynamics, subspace stability) and use them to provide mechanistic explanations for (1) the sudden emergence of ICL in attention-only networks and (2) delayed generalization (grokking) in modular arithmetic models. Our work offers an exact dynamical model for ICL and theoretically grounded tools for analyzing complex transformer training.

Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers

TL;DR

This work derives an exact, non-asymptotic SGD dynamics for a linear transformer performing in-context linear regression, revealing a clear separation of learning timescales along data-eigenmodes dictated by the input covariance spectrum. It shows that learning proceeds via mode-specific, nonlinear trajectories with a conserved quantity restricting dynamics, and the learned computation at convergence effectively preconditions inputs to recover the target linear map. The authors extend insights from solvable deep linear networks to propose macroscopic diagnostics (spectral rank dynamics, subspace stability, curvature-based loss analysis) and demonstrate qualitative parallels in non-linear, multi-layer transformers, including sudden ICL emergence and grokking phenomena. The results offer a principled analytic framework to interpret transformer training, with potential applications to interpretability, monitoring, and extension to nonlinear attention architectures and more complex tasks.

Abstract

Transformer models exhibit remarkable in-context learning (ICL), adapting to novel tasks from examples within their context, yet the underlying mechanisms remain largely mysterious. Here, we provide an exact analytical characterization of ICL emergence by deriving the closed-form stochastic gradient descent (SGD) dynamics for a simplified linear transformer performing regression tasks. Our analysis reveals key properties: (1) a natural separation of timescales directly governed by the input data's covariance structure, leading to staged learning; (2) an exact description of how ICL develops, including fixed points corresponding to learned algorithms and conservation laws constraining the dynamics; and (3) surprisingly nonlinear learning behavior despite the model's linearity. We hypothesize this phenomenology extends to non-linear models. To test this, we introduce theory-inspired macroscopic measures (spectral rank dynamics, subspace stability) and use them to provide mechanistic explanations for (1) the sudden emergence of ICL in attention-only networks and (2) delayed generalization (grokking) in modular arithmetic models. Our work offers an exact dynamical model for ICL and theoretically grounded tools for analyzing complex transformer training.

Paper Structure

This paper contains 38 sections, 109 equations, 7 figures.

Figures (7)

  • Figure 1: Fixed point parameters: theory vs. simulation. (A) Learned parameters ($p_2 q_1$) in batch-averaged simulation, diagonalized in the input covariance basis. (B) theoretical prediction for the fixed point $p_2 q_1 (\infty)$, showing precise quantitative agreement.
  • Figure 2: Parameter dynamics: theory vs. simulation. (A) Empirical, batch-averaged, evolution of the diagonal elements of the learned parameter $\bar{a}(t) = \bar{p}_2(t) \bar{q}_1(t)$ compared with theoretical predictions (Eq. \ref{['eq:a_alpha_solution_main']}), showing excellent agreement. (B) Mirror evolution of two terms in conserved quantity and confirmation of the stability of the conserved quantity $\mathcal{C} = \lVert \bar{p}_2 \rVert_F^2 - \lVert \bar{q}_1 \rVert_F^2$.
  • Figure 3: (A) Loss dynamics: theory vs. simulation. The empirical training loss curve closely matches the analytical prediction (Eq. \ref{['eq:loss_dynamics_main']}), capturing characteristic plateaus and subsequent rapid decreases (cliffs) corresponding to the sequential learning of different spectral modes. (B) Training loss dynamics of a transformer model trained in modular arithmetic task displays qualitative similarity to the training dynamics of linear transformer in (A), and exhibits delayed generalized phenomena called "grokking".
  • Figure 4: Separation of timescale can be identified in (A) Curvature of loss autocorrelation function, (B) Curvature of parameter norm dynamics, and (C) Marginalized effective rank measure.
  • Figure 5: Subspace Stabilization in Attention-Only Models (1-4 Layers). Spectral alignment metric (Subspace Distance, Eq. \ref{['eq:subspace_distance_main']}), averaged across OV and QK matrices, vs. training time. Lower values indicate subspace stability. The highlighted region marks ICL emergence. Subspace directions converge relatively early, preceding or coinciding with ICL emergence, especially in deeper models.
  • ...and 2 more figures