Loss Landscape Degeneracy and Stagewise Development in Transformers

Jesse Hoogland; George Wang; Matthew Farrugia-Roberts; Liam Carroll; Susan Wei; Daniel Murfet

Loss Landscape Degeneracy and Stagewise Development in Transformers

Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, Daniel Murfet

TL;DR

The paper probes how degeneracy in the local loss landscape—quantified by the local learning coefficient (LLC) from singular learning theory—tracks stagewise development in transformers. By monitoring LLC during training of language-model and in-context linear regression transformers, the authors identify plateaus in LLC that delineate developmental stages, each coinciding with interpretable shifts in internal structure (e.g., bigram/n-gram learning, positional information use, induction circuit formation) and input/output behavior (including in-context learning). They contrast LLC-based stage detection with Hessian-based metrics, showing LLC captures multiple stage boundaries that curvature metrics miss, and provide methodological details on SGLD-based LLC estimation and local Bayesian free-energy reasoning. The findings suggest degeneracy as a unifying, setting-agnostic lens for understanding how modern deep networks develop, with implications for developmental interpretability and mechanistic insight into transformer computation. Overall, the work offers empirical evidence that loss landscape degeneracy is closely linked to the emergence of higher-level computational structures in transformers, paving the way for more principled, degeneracy-driven analyses of deep learning development.

Abstract

Deep learning involves navigating a high-dimensional loss landscape over the neural network parameter space. Over the course of training, complex computational structures form and re-form inside the neural network, leading to shifts in input/output behavior. It is a priority for the science of deep learning to uncover principles governing the development of neural network structure and behavior. Drawing on the framework of singular learning theory, we propose that model development is deeply linked to degeneracy in the local geometry of the loss landscape. We investigate this link by monitoring loss landscape degeneracy throughout training, as quantified by the local learning coefficient, for a transformer language model and an in-context linear regression transformer. We show that training can be divided into distinct periods of change in loss landscape degeneracy, and that these changes in degeneracy coincide with significant changes in the internal computational structure and the input/output behavior of the transformers. This finding provides suggestive evidence that degeneracy and development are linked in transformers, underscoring the potential of a degeneracy-based perspective for understanding modern deep learning.

Loss Landscape Degeneracy and Stagewise Development in Transformers

TL;DR

Abstract

Paper Structure (111 sections, 32 equations, 32 figures, 4 tables)

This paper contains 111 sections, 32 equations, 32 figures, 4 tables.

Introduction
Related work
Degeneracy and development in singular learning theory
Degeneracy and development in nonlinear dynamics
Stagewise development in deep learning
Studying loss landscape geometry
Training transformers in two settings
Language modeling
In-context linear regression
Quantifying degeneracy with the local learning coefficient
The local learning coefficient (LLC)
Estimating the LLC
Assumptions of LLC estimation
Degeneracy-based stage division
Bayesian local free energy
...and 96 more sections

Figures (32)

Figure 1: Tracking loss landscape degeneracy reveals developmental stages. We train transformer models on both (a) natural language data and (b) synthetic in-context linear regression data. In addition to test loss (top row), we track loss landscape degeneracy as quantified by the local learning coefficient (LLC) (middle row; \ref{['section:llc']}). Critical points in the LLC curve mark boundaries between distinct developmental stages (bottom row; warm hues for increasing LLC, cold for decreasing LLC; \ref{['sec:slp']}). We show in \ref{['section:results_lm', 'sec:results']} that most of these stages coincide with the formation of significant internal structures or changes in input/output behavior. The language model first learns to predict using bigram statistics (\ref{['sec:LM1']}), then common $n$-grams (\ref{['sec:LM2']}), before forming the induction circuit studied by olsson2022context (\ref{['sec:LM3']}&\ref{['sec:LM4']}). The regression model first learns the optimal context-independent solution (\ref{['sec:LR1']}), then acquires robust in-context learning (\ref{['sec:LR2']}), then specializes to the pre-training distribution (\ref{['sec:LR3']}&\ref{['sec:LR4']}). These stage divisions and interpretations are specific to the above training runs, but we show in \ref{['appendix:lm-universality']} that similar divisions arise with different training seeds.
Figure 2: The local learning coefficient (LLC) measures loss landscape degeneracy. The LLC can be defined in terms of the rate at which the parameter space volume (within a given neighborhood and with a given maximum loss) shrinks as the loss threshold is reduced to that of the local minimum. We show four population loss landscapes for a two-dimensional parameter space with decreasing LLC (increasing degeneracy). In these examples, the local multiplicity is 1. See \ref{['appendix:app_geometry']} for a detailed description of each example, as well as several additional examples.
Figure 3: In the singular learning process, the Bayesian posterior can shift between neighborhoods with different degeneracy. Watanabe's free energy formula \ref{['eq:free_energy_formula']} highlights a tradeoff between loss $\ell_n$ (the linear term coefficient) and degeneracy $\lambda$ (the LLC, the logarithmic term coefficient). Consider two local minima $w_1^\ast, w_2^\ast$ with neighborhoods $W_1^\ast, W_2^\ast$. As the number of samples $n$ increases, if $w_2^\ast$ has lower loss and higher LLC than $w_1^\ast$, $W_2^\ast$ will suddenly achieve lower free energy than $W_1^\ast$ at some critical sample size $n_{\text{crit}}$, causing the Bayesian posterior to shift from concentrating in $W_1^\ast$ to $W_2^\ast$.
Figure 4: Language model stages coincide with significant structural and behavioral changes. (a) The model learns bigram statistics in \ref{['sec:LM1']}, (b) then the positional embedding becomes useful from \ref{['sec:LM2']}, (c) enabling the learning of common $n$-grams. Induction circuit formation begins with (d) previous-token heads in \ref{['sec:LM3']}, followed by (e) induction heads in \ref{['sec:LM4']}, leading to (f) a drop in ICL score indicating the acquisition of in-context learning. Note: in (d,e), $l{:}h$ denotes attention head $h$ in layer $l$; dark lines indicate heads comprising the induction circuit.
Figure 5: In-context linear regression model stages coincide with significant structural and behavioral changes. (a) During \ref{['sec:LR1']}, the model learns to make context-independent predictions, $x_k \mapsto \hat{y}_k = 0$. (b) During \ref{['sec:LR2']}, ICL performance improves, then during \ref{['sec:LR3']} the model becomes worse at ICL on OOD inputs $x_k \sim \mathcal{N}(0, gI_D)$ for $g > 3$. (c) During \ref{['sec:LR3']} and \ref{['sec:LR4']}, layer normalization weights "collapse," possibly contributing to the LLC decrease.
...and 27 more figures

Loss Landscape Degeneracy and Stagewise Development in Transformers

TL;DR

Abstract

Loss Landscape Degeneracy and Stagewise Development in Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (32)