Table of Contents
Fetching ...

Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

Yongzhong Xu

Abstract

Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary -- the \emph{spectral edge} -- between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio $σ_k/σ_{k+1}$. Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity ($k^* = 2$ at 51M, $k^* = 3$ at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size -- a \emph{lag flip} reflecting the timescale of trajectory integration. Johnson--Lindenstrauss projection to $d = 10W$ dimensions (e.g., $d = 100$ for $W = 10$) preserves the spectral gap within 5.7\%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking -- predicting generalization 600--1{,}700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.

Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

Abstract

Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary -- the \emph{spectral edge} -- between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio . Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity ( at 51M, at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size -- a \emph{lag flip} reflecting the timescale of trajectory integration. Johnson--Lindenstrauss projection to dimensions (e.g., for ) preserves the spectral gap within 5.7\%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking -- predicting generalization 600--1{,}700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.
Paper Structure (47 sections, 4 equations, 4 figures)

This paper contains 47 sections, 4 equations, 4 figures.

Figures (4)

  • Figure 1: Overview of Spectral Edge Dynamics.(A) Singular value spectrum of the trajectory matrix $\Delta$ (TinyStories 51M, step 2000). The first $k^* = 2$ modes (blue) carry the training signal; remaining modes form the noise bulk (grey). The Marchenko--Pastur edge (dashed) separates signal from noise. (B) Consecutive ratio profile $\sigma_k/\sigma_{k+1}$: the signal rank $k^*$ is the index of the maximum ratio above the MP edge. (C) Tracking $\sigma_2/\sigma_3$ across training reveals a universal Rise--Plateau--Collapse pattern (TinyStories, seed 42). (D) At GPT-2 124M scale ($k^* = 3$), the spectral edge ratio $\sigma_3/\sigma_4$ co-moves with validation loss, declining sharply at the distribution shift (dashed line).
  • Figure 2: Three-phase spectral edge pattern across TinyStories seeds. The spectral edge ratio $\sigma_2/\sigma_3$ rises during rapid learning, plateaus near its peak, and collapses as the second signal direction merges with the noise bulk. Collapse timing varies by $\leq 4\%$ of total training across seeds.
  • Figure 3: GPT-2 124M: Spectral edge ratio and validation loss. The spectral edge ratio $\sigma_3/\sigma_4$ (blue, left axis) is tightly coupled to val-loss (red, right axis). The distribution shift at step 17,800 produces a sharp spectral reorganization: the edge ratio drops abruptly, recovers during rapid improvement, then flattens as overfitting begins around step 22,200.
  • Figure 4: Phase-dependent spectral edge--loss coupling. Sliding-window Pearson correlation between $\sigma_3/\sigma_4$ and val-loss (window of 7 steps, $W = 10$). The correlation sign flips across training phases: strongly negative during late pretraining ($r \approx -0.95$), strongly positive after the distribution shift ($r \approx +0.83$), and negative again during overfitting ($r \approx -0.75$).

Theorems & Definitions (5)

  • Definition 1: Trajectory Matrix
  • Definition 2: Gram Matrix
  • Definition 3: Spectral Edge Ratio
  • Definition 4: Signal Rank
  • Definition 5: Spectral Gap