Table of Contents
Fetching ...

Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory

Runa Eschenhagen, Anna Cai, Tsung-Hsien Lee, Hao-Jun Michael Shi

TL;DR

This work investigates Shampoo and Muon as data-efficient matrix-based optimizers, revealing that Shampoo structurally equals a Muon update with left and right adaptations, analogous to Adam and Signum. Through language-model experiments, Shampoo variants deliver superior token efficiency over Muon and often over AdamW, with KL-Shampoo and Shampoo$^{1/2}$ frequently performing best; Shampoo benefits largely arise when applied to weight matrices, not to 1D or embedding parameters. The authors introduce time-averaged orthogonality in expectation to unify adaptation to stochasticity and the parameter trajectory with spectral descent, and show instantaneous KL-Shampoo converges to spectral descent, strengthening the link between matrix preconditioning and strict orthogonality constraints. Localizing Shampoo’s advantages, the paper demonstrates the critical role of two-sided preconditioning, the dependence on reshaping for matrix parameters, and the nontrivial influence of hyperparameters such as $eta_1$, $eta_2$, and $oldsymbol{b eps}$. Overall, the results offer mechanistic insight into adaptive matrix optimizers, guiding practical usage and pointing to future work on scalable, stable deployment and broader applicability.

Abstract

Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent analogous to how Adam and Signum reduce to sign descent, their general relationship and relative data efficiency under controlled settings remain unclear. Through extensive experiments on language models, we demonstrate that Shampoo achieves higher token efficiency than Muon, mirroring Adam's advantage over Signum. We show that Shampoo's update applied to weight matrices can be decomposed into an adapted Muon update. Consistent with this, Shampoo's benefits can be exclusively attributed to its application to weight matrices, challenging interpretations agnostic to parameter shapes. This admits a new perspective that also avoids shortcomings of related interpretations based on variance adaptation and whitening: rather than enforcing semi-orthogonality as in spectral descent, Shampoo's updates are time-averaged semi-orthogonal in expectation.

Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory

TL;DR

This work investigates Shampoo and Muon as data-efficient matrix-based optimizers, revealing that Shampoo structurally equals a Muon update with left and right adaptations, analogous to Adam and Signum. Through language-model experiments, Shampoo variants deliver superior token efficiency over Muon and often over AdamW, with KL-Shampoo and Shampoo frequently performing best; Shampoo benefits largely arise when applied to weight matrices, not to 1D or embedding parameters. The authors introduce time-averaged orthogonality in expectation to unify adaptation to stochasticity and the parameter trajectory with spectral descent, and show instantaneous KL-Shampoo converges to spectral descent, strengthening the link between matrix preconditioning and strict orthogonality constraints. Localizing Shampoo’s advantages, the paper demonstrates the critical role of two-sided preconditioning, the dependence on reshaping for matrix parameters, and the nontrivial influence of hyperparameters such as , , and . Overall, the results offer mechanistic insight into adaptive matrix optimizers, guiding practical usage and pointing to future work on scalable, stable deployment and broader applicability.

Abstract

Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent analogous to how Adam and Signum reduce to sign descent, their general relationship and relative data efficiency under controlled settings remain unclear. Through extensive experiments on language models, we demonstrate that Shampoo achieves higher token efficiency than Muon, mirroring Adam's advantage over Signum. We show that Shampoo's update applied to weight matrices can be decomposed into an adapted Muon update. Consistent with this, Shampoo's benefits can be exclusively attributed to its application to weight matrices, challenging interpretations agnostic to parameter shapes. This admits a new perspective that also avoids shortcomings of related interpretations based on variance adaptation and whitening: rather than enforcing semi-orthogonality as in spectral descent, Shampoo's updates are time-averaged semi-orthogonal in expectation.
Paper Structure (63 sections, 7 theorems, 72 equations, 6 figures, 14 tables, 1 algorithm)

This paper contains 63 sections, 7 theorems, 72 equations, 6 figures, 14 tables, 1 algorithm.

Key Result

Proposition 4.1

Let ${\bm{g}} \in \mathbb{R}^d$ be a random variable with $\mathbb{E}[{\bm{g}}] = \nabla {\mathcal{L}}$ and $\mathrm{Var}({\bm{g}}) = \sigma^2$. Then $\mathbb{E}[|| \bm{\gamma} \odot {\bm{g}} - \nabla {\mathcal{L}}||_2^2]$ is minimized by and $\mathbb{E}[|| \bm{\gamma} \odot \mathrm{sign}({\bm{g}}) - \mathrm{sign}(\nabla {\mathcal{L}})||_2^2]$ is minimized by

Figures (6)

  • Figure 1: Shampoo : Muon :: Adam : Signum. Adam has previously been interpreted as element-wise scaled Signum (left, top), which uses the sign of the gradient's EMA balles2017dissectingorvieto2025search. We show that Shampoo can analogously be understood as Muon, the matrix sign of the gradient's EMA, left- and right-multiplied by matrices (right, top). Just like for Adam (left, bottom), the adaptation in Shampoo results in improved token efficiency compared to Muon in a controlled language modeling setting (right, bottom). Different variants of the algorithms are analogous in both geometries (middle); see \ref{['tab:optimizer-variants-betas']} for the precise relationships.
  • Figure 2: Shampoo$^{1/2}$ ($p=1/8$ for 4D) with different reshaping strategies. Reshaping to 2D matrices outperforms reshaping to 1D vectors (full-matrix Adam) or preserving 4D tensor structure.
  • Figure 3: Full-batch setting. RMSProp and Shampoo$^{1/2}$ only adapt to the parameter trajectory through the EMA in their preconditioner, but outperform SignGD and SpectralGD, respectively.
  • Figure 4: Final validation perplexity (single runs) across search space of learning rate and $\beta$s of AdamW in the 320M, $1\times$ token budget, batch size 256 and 64 settings. See \ref{['tab:hparams-space']} for search space ranges. Note that underperforming runs may not be shown.
  • Figure 5: Final validation perplexity (single runs) across search space of learning rate and $\beta_1$ of Signum and Muon in the 320M, $1\times$ token budget, batch size 256 setting. See \ref{['tab:hparams-space']} for search space ranges. Note that underperforming runs may not be shown.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Proposition 4.1: restate=optimaladaptationgeneral, name=Lemma 1 in balles2017dissecting
  • Proposition 4.2: restate=optimaladaptationsign, name=
  • Proposition 4.3: restate=optimaladaptationspectral, name=
  • Definition 4.4: restate=matrixwhitening, name=
  • Corollary 4.5: restate=whitening, name=
  • Definition 4.6: restate=taoe, name=Time-averaged orthogonality in expectation
  • Corollary 4.7: restate=idealklshampoo, name=Idealized KL-Shampoo
  • Proposition 4.8: restate=instantklshampoo, name="Instantaneous" KL-Shampoo converges to spectral descent.
  • proof
  • proof
  • ...and 7 more