Table of Contents
Fetching ...

The Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training

Yongzhong Xu

Abstract

We develop the spectral edge thesis: phase transitions in neural network training -- grokking, capability gains, loss plateaus -- are controlled by the spectral gap of the rolling-window Gram matrix of parameter updates. In the extreme aspect ratio regime (parameters $P \sim 10^8$, window $W \sim 10$), the classical BBP detection threshold is vacuous; the operative structure is the intra-signal gap separating dominant from subdominant modes at position $k^* = \mathrm{argmax}\, σ_j/σ_{j+1}$. From three axioms we derive: (i) gap dynamics governed by a Dyson-type ODE with curvature asymmetry, damping, and gradient driving; (ii) a spectral loss decomposition linking each mode's learning contribution to its Davis--Kahan stability coefficient; (iii) the Gap Maximality Principle, showing that $k^*$ is the unique dynamically privileged position -- its collapse is the only one that disrupts learning, and it sustains itself through an $α$-feedback loop requiring no assumption on the optimizer. The adiabatic parameter $\mathcal{A} = \|ΔG\|_F / (η\, g^2)$ controls circuit stability: $\mathcal{A} \ll 1$ (plateau), $\mathcal{A} \sim 1$ (phase transition), $\mathcal{A} \gg 1$ (forgetting). Tested across six model families (150K--124M parameters): gap dynamics precede every grokking event (24/24 with weight decay, 0/24 without), the gap position is optimizer-dependent (Muon: $k^*=1$, AdamW: $k^*=2$ on the same model), and 19/20 quantitative predictions are confirmed. The framework is consistent with the edge of stability, Tensor Programs, Dyson Brownian motion, the Lottery Ticket Hypothesis, and neural scaling laws.

The Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training

Abstract

We develop the spectral edge thesis: phase transitions in neural network training -- grokking, capability gains, loss plateaus -- are controlled by the spectral gap of the rolling-window Gram matrix of parameter updates. In the extreme aspect ratio regime (parameters , window ), the classical BBP detection threshold is vacuous; the operative structure is the intra-signal gap separating dominant from subdominant modes at position . From three axioms we derive: (i) gap dynamics governed by a Dyson-type ODE with curvature asymmetry, damping, and gradient driving; (ii) a spectral loss decomposition linking each mode's learning contribution to its Davis--Kahan stability coefficient; (iii) the Gap Maximality Principle, showing that is the unique dynamically privileged position -- its collapse is the only one that disrupts learning, and it sustains itself through an -feedback loop requiring no assumption on the optimizer. The adiabatic parameter controls circuit stability: (plateau), (phase transition), (forgetting). Tested across six model families (150K--124M parameters): gap dynamics precede every grokking event (24/24 with weight decay, 0/24 without), the gap position is optimizer-dependent (Muon: , AdamW: on the same model), and 19/20 quantitative predictions are confirmed. The framework is consistent with the edge of stability, Tensor Programs, Dyson Brownian motion, the Lottery Ticket Hypothesis, and neural scaling laws.

Paper Structure

This paper contains 99 sections, 49 theorems, 131 equations, 5 figures, 8 tables.

Key Result

Proposition 4.1

In the extreme aspect ratio regime ($p \gg W$, $p \sim 10^6$--$10^{10}$, $W \sim 10$--$30$), the BBP detection threshold for isotropic noise $\bm{\Sigma}_N = \nu^2 \bm{I}_p$ is This threshold is trivially satisfied by every eigenvalue. Specifically, the per-coordinate noise standard deviation $\nu = O(\eta / \sqrt{B \cdot p})$ where $B$ is the batch size, giving: For typical values ($\eta = 10^{

Figures (5)

  • Figure 1: The spectral edge framework.(A) Singular value spectrum of the Gram matrix $\bm{G}$ for TinyStories 51M, showing the intra-signal gap at $k^* = 2$. All eigenvalues are far above the BBP noise detection threshold (dashed; see Proposition \ref{['prop:bbp-vacuous']}). (B) Ratio profile $\sigma_k/\sigma_{k+1}$: the maximum at $k = 2$ defines the signal rank. (C) Gap ratio $\sigma_2/\sigma_3$ over training for TinyStories 51M, showing the three-phase pattern: rise, plateau, collapse. (D) Gap ratio $\sigma_3/\sigma_4$ tracks validation loss for GPT-2 124M. The spectral edge event ("Shift") coincides with the loss plateau.
  • Figure 2: Spectral edge analysis for TinyStories 51M.Top: Consecutive singular value ratios $\sigma_k/\sigma_{k+1}$ over training. The $\sigma_1/\sigma_2$ gap (red) dominates during the plateau phase, while $\sigma_2/\sigma_3$ (blue) shows the three-phase rise--plateau--collapse pattern. Middle: Eigenvalue gaps $\sigma_k^2 - \sigma_{k+1}^2$ over training, showing the same three-phase structure. Bottom: Gap ratio $\sigma_2/\sigma_3$ (blue) overlaid with validation loss (red), confirming that the spectral edge tracks learning progress.
  • Figure 3: Stability coefficient hierarchy for GPT-2 124M. Heatmap of $\alpha_j$ (Definition \ref{['def:stability-coeff']}) across eigenvalue index $j$ (vertical) and training step (horizontal). The dominant mode ($j = 1$, green) has $\alpha \approx 1$ throughout training. The gap mode ($j = 2$) fluctuates between stable (green) and unstable (red) as $k^*$ shifts. All subdominant modes ($j \geq 3$, red) have $\alpha \approx 0$. The hierarchy $\alpha_1 > \alpha_2 > \cdots$ is visually obvious.
  • Figure 4: Grokking as a spectral edge event (Dyck-1). Grokking runs (weight decay $\lambda = 1.0$, red/solid, 3 seeds) vs. control runs ($\lambda = 0$, blue/dashed, 3 seeds). Top: singular value ratio $\sigma_1/\sigma_2$ of $W_Q$. With weight decay, the ratio rises dramatically as $\sigma_2 \to 0$ (gap opens); without weight decay, the ratio stays flat (no gap). Middle: Frobenius norm $\|W_Q\|_F$. Weight decay compresses the weights; the control runs retain large norms. Bottom: test accuracy. Grokking (accuracy $\to 1$) occurs only in the weight-decay runs, coinciding with the gap opening above. All 6 grokking runs show gap opening; all 6 controls show neither gap opening nor grokking.
  • Figure 5: Dependency graph of the spectral edge framework. Green = no dependence on $[\mathcal{P}, H] \approx 0$. Yellow (dashed) = qualitative conclusion survives; only exact coefficients need commutativity. Orange (thick dashed) = both structure and conclusion require $[\mathcal{P}, H] \approx 0$. The core diagnostic and learning-theoretic results (left column and NTK branch) are entirely clean. The flow equations are weakly dependent. Only the Krylov bound, collapse/opening times, and $\beta_2$ theorem have strong dependence.

Theorems & Definitions (144)

  • Definition 1.1: Training Trajectory
  • Definition 1.2: Rolling Window and Trajectory Matrix
  • Definition 1.3: Gram Matrix and Spectrum
  • Remark 1.4: Why the Gram Matrix
  • Definition 1.5: Sliding-Window Covariance
  • Definition 1.6: Aspect Ratio
  • Remark 1.7: Classical vs. Extreme RMT
  • Remark 2.4: Justification of the Axioms
  • Remark 3.1: Architecture Independence: A Geometric Flow Theory
  • Proposition 4.1: BBP is Vacuous
  • ...and 134 more