Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

Yongzhong Xu

Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

Yongzhong Xu

Abstract

Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary -- the \emph{spectral edge} -- between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio $σ_k/σ_{k+1}$. Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity ($k^* = 2$ at 51M, $k^* = 3$ at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size -- a \emph{lag flip} reflecting the timescale of trajectory integration. Johnson--Lindenstrauss projection to $d = 10W$ dimensions (e.g., $d = 100$ for $W = 10$) preserves the spectral gap within 5.7\%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking -- predicting generalization 600--1{,}700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.

Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

Abstract

. Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity (

at 51M,

at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size -- a \emph{lag flip} reflecting the timescale of trajectory integration. Johnson--Lindenstrauss projection to

dimensions (e.g.,

for

) preserves the spectral gap within 5.7\%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking -- predicting generalization 600--1{,}700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.

Paper Structure (47 sections, 4 equations, 4 figures)

This paper contains 47 sections, 4 equations, 4 figures.

Introduction
Framework
Trajectory Matrix and Gram Matrix Trick
Precomputed dot-product matrix.
Signal Rank and Spectral Gap
Cross-Correlation Analysis
Geometric Properties
Experiment 1: TinyStories (51M Parameters)
Setup
Signal Detection and Noise Structure (Property \ref{['pred:bbp']})
Three-Phase Spectral Pattern (Property \ref{['pred:universal']})
Phase-Specific Correlations
Lag Flip with Window Size (Property \ref{['pred:lagflip']})
Comparison with Drift Speed
Johnson--Lindenstrauss Projection (Property \ref{['pred:jl']})
...and 32 more sections

Figures (4)

Figure 1: Overview of Spectral Edge Dynamics.(A) Singular value spectrum of the trajectory matrix $\Delta$ (TinyStories 51M, step 2000). The first $k^* = 2$ modes (blue) carry the training signal; remaining modes form the noise bulk (grey). The Marchenko--Pastur edge (dashed) separates signal from noise. (B) Consecutive ratio profile $\sigma_k/\sigma_{k+1}$: the signal rank $k^*$ is the index of the maximum ratio above the MP edge. (C) Tracking $\sigma_2/\sigma_3$ across training reveals a universal Rise--Plateau--Collapse pattern (TinyStories, seed 42). (D) At GPT-2 124M scale ($k^* = 3$), the spectral edge ratio $\sigma_3/\sigma_4$ co-moves with validation loss, declining sharply at the distribution shift (dashed line).
Figure 2: Three-phase spectral edge pattern across TinyStories seeds. The spectral edge ratio $\sigma_2/\sigma_3$ rises during rapid learning, plateaus near its peak, and collapses as the second signal direction merges with the noise bulk. Collapse timing varies by $\leq 4\%$ of total training across seeds.
Figure 3: GPT-2 124M: Spectral edge ratio and validation loss. The spectral edge ratio $\sigma_3/\sigma_4$ (blue, left axis) is tightly coupled to val-loss (red, right axis). The distribution shift at step 17,800 produces a sharp spectral reorganization: the edge ratio drops abruptly, recovers during rapid improvement, then flattens as overfitting begins around step 22,200.
Figure 4: Phase-dependent spectral edge--loss coupling. Sliding-window Pearson correlation between $\sigma_3/\sigma_4$ and val-loss (window of 7 steps, $W = 10$). The correlation sign flips across training phases: strongly negative during late pretraining ($r \approx -0.95$), strongly positive after the distribution shift ($r \approx +0.83$), and negative again during overfitting ($r \approx -0.75$).

Theorems & Definitions (5)

Definition 1: Trajectory Matrix
Definition 2: Gram Matrix
Definition 3: Spectral Edge Ratio
Definition 4: Signal Rank
Definition 5: Spectral Gap

Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

Abstract

Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

Authors

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (5)