Table of Contents
Fetching ...

Dynamics of Transient Structure in In-Context Linear Regression Transformers

Liam Carroll, Jesse Hoogland, Matthew Farrugia-Roberts, Daniel Murfet

TL;DR

The paper investigates transient structure in transformers trained on in-context linear regression across varying task diversity $M$, showing a progression from ridge-like generalization to dMMSE memorization. It combines joint trajectory PCA to reveal a generalization-memorization axis in function space with a Bayesian-inspired loss/complexity tradeoff based on the local learning coefficient (LLC) to explain the dynamics. The authors replicate the transient ridge phenomenon, identify a task-diversity threshold, and provide empirical LLC-based validation linking development to an evolving balance between accuracy and model complexity. These results offer a principled lens—dynamic internal model selection—for understanding how internal representations in deep networks develop and reorganize under changing data distributions, with implications for generalization and robustness.

Abstract

Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.

Dynamics of Transient Structure in In-Context Linear Regression Transformers

TL;DR

The paper investigates transient structure in transformers trained on in-context linear regression across varying task diversity , showing a progression from ridge-like generalization to dMMSE memorization. It combines joint trajectory PCA to reveal a generalization-memorization axis in function space with a Bayesian-inspired loss/complexity tradeoff based on the local learning coefficient (LLC) to explain the dynamics. The authors replicate the transient ridge phenomenon, identify a task-diversity threshold, and provide empirical LLC-based validation linking development to an evolving balance between accuracy and model complexity. These results offer a principled lens—dynamic internal model selection—for understanding how internal representations in deep networks develop and reorganize under changing data distributions, with implications for generalization and robustness.

Abstract

Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.

Paper Structure

This paper contains 74 sections, 27 equations, 25 figures, 2 tables.

Figures (25)

  • Figure 1: Behavioral dynamics of the transient ridge phenomenon.(Top left): OOD loss over training on sequences sampled with a Gaussian task distribution for task diversities $M \in \mathcal{M}$. For intermediate $M$ we see non-monotonicity caused by the transient ridge phenomenon, or "forgetting" as observed by panwar2024bayesianprism. We define $t^{\operatorname{crit}}_{M}$ as the step at which the OOD loss is minimized for $M$ (\ref{['appendix:tcrit']}). We mark this step with a circle in the other plots. (Right): We project each transformer's trajectory $\{f(\cdot, w_t^M)\}_{t \in \mathcal{C}}$ to a curve $\gamma_M(t)$ in the essential subspace computed by joint trajectory PCA. We project dMMSE$_{M}$ (diamonds) and ridge (square) into the same subspace. For intermediate task diversity $M$, the development is deflected towards ridge on its way towards dMMSE$_{M}$. (Bottom left): In-distribution function-space distances $\Delta(\cdot, \text{dMMSE$_{M}$/Ridge})$ clarify which fully-trained transformers (stars) approximate dMMSE$_{M}$, and which transformers approximate ridge at $t^{\operatorname{crit}}_{M}$ (circles). (Note): loss and PC curves are lightly smoothed, see \ref{['section:gaussian_smoothing']} for raw data.
  • Figure 2: Transient ridge in the loss landscape. Conceptual illustration of transient ridge arising as the result of an evolving tradeoff between loss and LLC (complexity, illustrated as sharpness). As $M$ increases, we expect the loss gap between dMMSE$_{M}$ and ridge to shrink and the LLC of dMMSE$_{M}$ to grow, creating transience for intermediate $M$.
  • Figure 3: Loss and LLC estimates match predictions.(Top): Estimated loss with respect to data distribution $q_M(S)$ for the idealized predictors and fully-trained transformers. The gap between dMMSE$_{M}$ and ridge decreases with $M$, and trained transformers approximate this loss on either side of the task diversity threshold (diamonds for dMMSE$_{M}$, squares for ridge). (Bottom): Estimated LLC for fully-trained transformers. Large-$M$ LLCs converge to the LLC of ridge (dashed line). Small-$M$ LLCs, representing the LLC of dMMSE$_{M}$, cross this line as $M$ increases.
  • Figure 2.1: Model evaluation and table of critical times for the primary architecture. Each dataset $\mathcal{D}^{(M)}$ is fixed for all timesteps, with $B=512$ samples in each.
  • Figure 3.1: Alignment of PC trajectories with loss dynamics.(Top): First two principal component trajectories $\gamma_M(t) = (\gamma_M^1(t), \gamma_M^2(t))$---(left) and (right) respectively---over training time $t$, corresponding to the data of \ref{['fig:ood-and-pca']}. (Bottom): Corresponding loss curves $\ell_B^{1}$(left) and $\ell_B^{\infty}$(right). There is a striking correlation between each PC1 curve its corresponding root-distribution loss $\ell_B^{1}$, which we provide an explanation of in \ref{['appendix:PC1-dev-time']}. Transient ridge is seen in the PC2 curves $\gamma_M^2(t)$, whose critical points coincide with the minima of each $\ell_B^{\infty}$ curve which defines $t^{\operatorname{crit}}_{}$ as in \ref{['appendix:tcrit']}
  • ...and 20 more figures