Table of Contents
Fetching ...

When do spectral gradient updates help in deep learning?

Damek Davis, Dmitriy Drusvyatskiy

TL;DR

The paper explains when spectral gradient updates (like SpecGD and MuON) outperform Euclidean gradient descent in training deep nets and transformers by introducing a layerwise condition nr(G) ≥ st(A), linking gradient structure to activation degeneracy. It shows post-activation matrices generically have low stable rank under Gaussian initialization, and that in random-feature models the gradient’s nuclear rank grows with dimension after a short burn-in, creating a regime where spectral updates yield larger one-step loss decreases. The authors extend the analysis to a general layered model and transformer blocks, providing a Hessian-based, layerwise bound that translates into a practical descent comparison; they validate the theory with synthetic experiments and NanoGPT-scale training, where internal activations exhibit low stable rank and gradients maintain large nuclear rank. The results offer a concrete, data-driven explanation for the regimes where spectral gradient methods are advantageous, particularly in internal transformer and MLP blocks, while noting exceptions (e.g., gated activations) where the benefit may diminish. Overall, the work connects activation degeneracy, gradient spectral structure, and geometry-aware optimization to explain and predict when spectral methods yield tangible gains in deep learning practice.

Abstract

Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.

When do spectral gradient updates help in deep learning?

TL;DR

The paper explains when spectral gradient updates (like SpecGD and MuON) outperform Euclidean gradient descent in training deep nets and transformers by introducing a layerwise condition nr(G) ≥ st(A), linking gradient structure to activation degeneracy. It shows post-activation matrices generically have low stable rank under Gaussian initialization, and that in random-feature models the gradient’s nuclear rank grows with dimension after a short burn-in, creating a regime where spectral updates yield larger one-step loss decreases. The authors extend the analysis to a general layered model and transformer blocks, providing a Hessian-based, layerwise bound that translates into a practical descent comparison; they validate the theory with synthetic experiments and NanoGPT-scale training, where internal activations exhibit low stable rank and gradients maintain large nuclear rank. The results offer a concrete, data-driven explanation for the regimes where spectral gradient methods are advantageous, particularly in internal transformer and MLP blocks, while noting exceptions (e.g., gated activations) where the benefit may diminish. Overall, the work connects activation degeneracy, gradient spectral structure, and geometry-aware optimization to explain and predict when spectral methods yield tangible gains in deep learning practice.

Abstract

Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.

Paper Structure

This paper contains 56 sections, 34 theorems, 501 equations, 19 figures, 2 tables.

Key Result

Lemma 2.1

The following holds:

Figures (19)

  • Figure 1: We generate $n= 512$ random gaussian vectors of dimension $d=128$ and use target $y$ which is the product of the first 3 coordinates. We then train a 4 layer feedforward neural network, with activation function $\sigma(t) = \max\{0, t\}^2$, mapping from input space $\mathbb{R}^d$ to $\mathbb{R}$ as follows: $\mathbb{R}^d \rightarrow \mathbb{R}^{4d} \rightarrow \mathbb{R}^{4d} \rightarrow \mathbb{R}^{4d} \rightarrow \mathbb{R}$. We start at the standard pytorch random initialization and run full batch algorithms: Gradient Descent and a Spectral descent method (only on layers 2 and 3, while layer 1 and 4 simply take Euclidean gradient steps; see Section \ref{['sec:multiple']} for a justification of this choice) and observe the stable ranks of the activation matrices. We note that the maximum possible stable rank of such matrices is $256.$
  • Figure 2: Stable rank of MLP post activations while running a particular snapshot (specifically modded_nanogpt_july18) of the modded-NanoGPT repo moddednanogpt2025. We also include the July 18th snapshot of the architecture at the time of training in Section \ref{['figure:nanogpt']}. We note that the maximal rank of any activation matrix in this plot is $3072$. Thus, the the stable rank of the post-activations are far below their maximal value.
  • Figure 3: Comparison of gradient descent ($\mathtt{GD}$) and spectral gradient descent $(\mathtt{SpecGD})$ on the random feature model $\min_{W}\mathcal{L}(W)=\tfrac{1}{2n}\|WA-Y\|_F^2$ with $W\in \mathbb{R}^{100\times 100}$. The ground truth matrix $W_{\sharp}\in \mathbb{R}^{100\times 100}$ is drawn with iid standard Gaussian entries. The data matrix is generated by $A=\sigma(W_1 X)$, where $W_1\in \mathbb{R}^{100\times 50}$ and $X\in \mathbb{R}^{100\times 400}$ have iid standard Gaussian entries and $\sigma(t)=\max\{0,t\}$ is the ReLU activation function applied coordinatewise. The target matrix $Y=W_{\sharp}A$ is generated from a ground truth matrix $W_{\sharp}\in \mathbb{R}^{100\times 100}$ that has iid standard Gaussian entries. Both methods are initialized at the all-zeros matrix. Left: supoptimality gap in function value along the $\mathtt{GD}$ (solid blue) and $\mathtt{SpecGD}$ (solid gold) iterations. The dashed curves plot the suboptimality gap if we were to initialize $\mathtt{SpecGD}$ at the current $\mathtt{GD}$ step, plotted every $100$ iterations. The superior performance of $\mathtt{SpecGD}$ is clear from the figure. Right: nuclear rank $\mathrm{nr}(\nabla \mathcal{L}(W))$ of the gradient along the $\mathtt{GD}$ and $\mathtt{SpecGD}$ iterations initialized at the all-zeros matrix; the black dashed line signifies the level ${\rm st}(A)$, above which $\mathtt{SpecGD}$ is superior to $\mathtt{GD}$. The nuclear rank $\mathrm{nr}(\nabla \mathcal{L}(W))$ can be large (with \ref{['eq:sd-vs-gd-condition']} holding) along the trajectories.
  • Figure 4: The training loss and gradient nuclear rank associated to the sparse regression problem in Figure \ref{['fig:synthetic_stable_rank_sparse']}. We see that the training loss for $\mathtt{SpecGD}$ decreases significantly faster than for $\mathtt{GD}$ during the initial phase of training, when the nuclear rank of the gradient is large. The maximum possible nuclear rank at layer 1 is 128, while for layers 2 and 3, it is 256.
  • Figure 5: Modded NanoGPT MLP gradient nuclear ranks.
  • ...and 14 more figures

Theorems & Definitions (64)

  • Lemma 2.1: Consistency of the stable rank
  • proof
  • Theorem 2.2: Constant probability guarantees
  • proof
  • Theorem 2.3: High probability
  • proof
  • Example 2.1
  • Lemma 2.4: Simple bound
  • Example 2.2: NSR for a single-hidden layer with centered Gaussian data
  • Lemma 2.5: Consistency of the Empirical NSR
  • ...and 54 more