Table of Contents
Fetching ...

Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy

Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Li Shang

TL;DR

The paper identifies a persistent spike-tail structure in LLM gradient spectra, where a small low-rank subspace dominates optimization and suppresses learning in the long tail. It introduces Spectra, a spike-aware optimizer that attenuates the dominant spike subspace using cached, warm-started power iteration and low-rank spectral shaping, avoiding amplification of noise-dominated tail directions. Empirically, Spectra accelerates convergence and reduces optimizer memory while improving downstream accuracy on large models (e.g., $8$B parameters) compared to AdamW and Muon, with favorable latency and memory characteristics. This work demonstrates that viewing gradients as structured spectral objects enables targeted, efficient optimization for large-scale language models, yielding practical gains in speed, stability, and performance.

Abstract

Gradient signals in LLM training are highly anisotropic: recurrent linguistic structure concentrates energy into a small set of dominant spectral directions, while context specific information resides in a long tail. We show that this spike tail separation persists throughout training, with the spike occupying only about 1.5% of directions yet dominating optimizer statistics. This dominance suppresses tail learning by contracting tail updates through second moment normalization and tightening the globally stable learning rate bound. Motivated by this analysis, we propose Spectra, a spike aware optimizer that suppresses the dominant low rank spike subspace without amplifying the noise sensitive spectral tail. Spectra tracks the spike subspace via cached, warm started power iteration and applies low rank spectral shaping with negligible overhead and substantially reduced optimizer state memory. On LLaMA3 8B trained on 50B tokens, Spectra reaches the same target loss 30% faster than AdamW, reduces per step end to end overhead by 0.7%, cuts optimizer state memory by 49.25%, and improves average downstream accuracy by 1.62%. Compared to Muon, Spectra is 5.1x faster in optimizer processing time, achieves a lower final loss, and improves average accuracy by 0.66%.

Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy

TL;DR

The paper identifies a persistent spike-tail structure in LLM gradient spectra, where a small low-rank subspace dominates optimization and suppresses learning in the long tail. It introduces Spectra, a spike-aware optimizer that attenuates the dominant spike subspace using cached, warm-started power iteration and low-rank spectral shaping, avoiding amplification of noise-dominated tail directions. Empirically, Spectra accelerates convergence and reduces optimizer memory while improving downstream accuracy on large models (e.g., B parameters) compared to AdamW and Muon, with favorable latency and memory characteristics. This work demonstrates that viewing gradients as structured spectral objects enables targeted, efficient optimization for large-scale language models, yielding practical gains in speed, stability, and performance.

Abstract

Gradient signals in LLM training are highly anisotropic: recurrent linguistic structure concentrates energy into a small set of dominant spectral directions, while context specific information resides in a long tail. We show that this spike tail separation persists throughout training, with the spike occupying only about 1.5% of directions yet dominating optimizer statistics. This dominance suppresses tail learning by contracting tail updates through second moment normalization and tightening the globally stable learning rate bound. Motivated by this analysis, we propose Spectra, a spike aware optimizer that suppresses the dominant low rank spike subspace without amplifying the noise sensitive spectral tail. Spectra tracks the spike subspace via cached, warm started power iteration and applies low rank spectral shaping with negligible overhead and substantially reduced optimizer state memory. On LLaMA3 8B trained on 50B tokens, Spectra reaches the same target loss 30% faster than AdamW, reduces per step end to end overhead by 0.7%, cuts optimizer state memory by 49.25%, and improves average downstream accuracy by 1.62%. Compared to Muon, Spectra is 5.1x faster in optimizer processing time, achieves a lower final loss, and improves average accuracy by 0.66%.
Paper Structure (26 sections, 1 theorem, 11 equations, 13 figures, 7 tables, 2 algorithms)

This paper contains 26 sections, 1 theorem, 11 equations, 13 figures, 7 tables, 2 algorithms.

Key Result

Theorem 2.1

Assume $\mathbf{H}\succeq \mathbf{0}$ and consider the update $\mathbf{w}^{+}=\mathbf{w}-\eta\,\mathbf{g}$. Under the second-order approximation of $\mathbb{E}[L(\mathbf{w}^{+})]$, the mean-optimal learning rate is Moreover, it is upper bounded by the spike variance contribution: If $\mathbf{H}\succeq \mu \mathbf{I}$ for some $\mu>0$, then

Figures (13)

  • Figure 1: Singular-value spectra of the deepest-layer MLP gradient in Qwen3 models (0.6B--32B) at multiple training stages exhibit a consistent "low-rank spike + smooth tail" profile, with spike singular values separated from the tail by $\sim$1--2 orders of magnitude and occupying a nearly constant $\approx 1.5\%$ of directions.
  • Figure 2: Gradient spectrum under two controlled interventions on Qwen3-0.6B: frequency-normalized loss (FreqNorm, top) selectively suppresses the leading spike components, while intra-sentence token permutation (Shuffle, bottom) selectively amplifies them; in both cases, changes rapidly vanish in the tail.
  • Figure 3: Spike-dominated second-moment accumulation suppresses tail updates (Qwen3-0.6B). Top: cumulative spectral energy (CDF) of AdamW moments, showing that the second moment $\mathbf{V}$ is far more spike-concentrated than the first moment $\mathbf{M}$. Bottom: element-wise magnitudes of tail updates, where full normalization $\mathbf{M}_t/(\sqrt{\mathbf{V}_s+\mathbf{V}_t}+\epsilon)$ is strongly contracted relative to the tail-only baseline $\mathbf{M}_t/(\sqrt{\mathbf{V}_t}+\epsilon)$.
  • Figure 4: $\mathrm{RelVar}(k)=\mathrm{Var}(a_k)/\sigma_k^2$ increases with $k$, indicating more noise-dominated small-singular directions.
  • Figure 5: Alignment between singular directions of $G$ and $\mathrm{NS}(G)$. NS largely preserves head directions but severely disrupts tail directions.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem 2.1: Spike-dominated variance bounds the mean-optimal learning rate
  • proof : Proof of Theorem \ref{['thm:spike_lr_bound']}