Table of Contents
Fetching ...

Enhancing LLM Training via Spectral Clipping

Xiaowen Jiang, Andrei Semenov, Sebastian U. Stich

Abstract

While spectral-based optimizers like Muon operate directly on the spectrum of updates, standard adaptive methods such as AdamW do not account for the global spectral structure of weights and gradients, leaving them vulnerable to two empirical issues in large language model (LLM) training: (i) the optimizer updates can have large spectral norms, potentially destabilizing training and degrading generalization; (ii) stochastic gradient noise can exhibit sparse spectral spikes, with a few dominant singular values much larger than the rest. We propose SPECTRA, a general framework addressing these by (i) post-spectral clipping of updates to enforce spectral-norm constraints; (ii) optional pre-spectral clipping of gradients to suppress spectral noise spikes. We prove that post-clipping constitutes a Composite Frank-Wolfe method with spectral-norm constraints and weight regularization, recovering Frobenius and $\ell_{\infty}$-norm regularization with SGD-based and sign-based methods. We further analyze how pre-clipping mitigates sparse spectral spikes. We propose efficient soft spectral clipping via Newton-Schulz iterations, avoiding expensive SVD. Experiments on LLM pretraining show SPECTRA uniformly improves validation loss for various optimizers, including AdamW, Signum, and AdEMAMix, with the best-performing variants achieving state-of-the-art results. Models trained with SPECTRA exhibit smaller weight norms, confirming the link between spectral clipping and regularization.

Enhancing LLM Training via Spectral Clipping

Abstract

While spectral-based optimizers like Muon operate directly on the spectrum of updates, standard adaptive methods such as AdamW do not account for the global spectral structure of weights and gradients, leaving them vulnerable to two empirical issues in large language model (LLM) training: (i) the optimizer updates can have large spectral norms, potentially destabilizing training and degrading generalization; (ii) stochastic gradient noise can exhibit sparse spectral spikes, with a few dominant singular values much larger than the rest. We propose SPECTRA, a general framework addressing these by (i) post-spectral clipping of updates to enforce spectral-norm constraints; (ii) optional pre-spectral clipping of gradients to suppress spectral noise spikes. We prove that post-clipping constitutes a Composite Frank-Wolfe method with spectral-norm constraints and weight regularization, recovering Frobenius and -norm regularization with SGD-based and sign-based methods. We further analyze how pre-clipping mitigates sparse spectral spikes. We propose efficient soft spectral clipping via Newton-Schulz iterations, avoiding expensive SVD. Experiments on LLM pretraining show SPECTRA uniformly improves validation loss for various optimizers, including AdamW, Signum, and AdEMAMix, with the best-performing variants achieving state-of-the-art results. Models trained with SPECTRA exhibit smaller weight norms, confirming the link between spectral clipping and regularization.
Paper Structure (34 sections, 23 theorems, 123 equations, 24 figures, 13 tables, 3 algorithms)

This paper contains 34 sections, 23 theorems, 123 equations, 24 figures, 13 tables, 3 algorithms.

Key Result

Proposition 1

The update rule eq:SGDM-spectral-clipping can be equivalently reformulated as the following stochastic composite Frank-Wolfe method: by choosing $\gamma_k = \lambda \eta_k$, $c_k \equiv \frac{\lambda D_2}{\alpha}$ and $\psi(\mathbf{X}) = \frac{\lambda}{2 \alpha} \|\mathbf{X}\|_F^2$, provided that $Q = Q_2$ and each subproblem for computing $\mathbf{V}_{k+1}$ is solved exactly.

Figures (24)

  • Figure 1: Final validation loss comparison for large Llama-style models trained for the Chinchilla optimal horizon. 'Thin and deep' models have smaller embedding dimensions but have more layers than 'wide and shallow' ones. The corresponding validation perplexities and running time comparisons w/wo using SPECTRA are provided in Figure \ref{['fig:780M-16k-perplexity']} and \ref{['fig:run_time_iter_820M']}. We use a total batchsize of 1012 and 992 for 780M and 820M model respectively with 1024 sequence length. The hyperparameters such as learning rate used for each method are reported in the tables in Section \ref{['sec:lLM-pretrain-appendix']}.
  • Figure 2: Validation loss comparison under small batch size training. Spectra-Signum maintains stability at the large learning rate, whereas Signum exhibits training instability (Signum typically prefers smaller LR than Adam-ish methods semenov2025benchmarking). Spectra-Signum with both pre and post clipping almost overlap with Spectra-Signum with post clipping.
  • Figure 3: Final validation loss comparison for a small size Llama model trained with both cos and wsd learning rate schedule. The run time comparisons w/wo using SPECTRA are provided in Figure \ref{['fig:run_time_iter_160M_wsd']}. We use a total batchsize of 256 with 512 sequence length. The hyperparameters such as learning rate used for each method are reported in the tables in Section \ref{['sec:lLM-pretrain-appendix']}.
  • Figure 4: Evolution of the aggregated $\ell_{\infty}$ and Frobenius norms of all weight matrices during training. SPECTRA consistently maintains a lower $\ell_{\infty}$ norm compared to the base optimizer.
  • Figure 5: Comparison of different methods with or without using SPECTRA during the warm-up phase. SPECTRA improves convergence across three methods. (AdamW exhibits training instability during warm-up phase due to the use of a large step size required to achieve a low final validation loss.)
  • ...and 19 more figures

Theorems & Definitions (46)

  • Proposition 1
  • proof
  • Theorem 1
  • Remark 2
  • Definition 1
  • Lemma 3
  • Lemma 4
  • Theorem 5
  • Theorem 6
  • Definition 2
  • ...and 36 more