Table of Contents
Fetching ...

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, Hao-Jun Michael Shi

TL;DR

This work interrogates Shampoo’s heuristics by decoupling its preconditioner eigenvalues from the eigenbasis and introducing eigenvalue-corrected variants (EShampoo and SOAP). It shows that learning rate grafting is a compensation for eigenvalue staleness, and that direct eigenvalue corrections can remove the need for grafting, yielding updates closer to full-matrix Adam. To address approximation error from stale eigenbases, the authors propose an adaptive warm-started QR criterion that selectively updates eigenbases per Kronecker factor, achieving notable efficiency gains while preserving convergence behavior. Empirically, adaptive eigenbasis updates and eigenvalue corrections demonstrate robust performance across diverse workloads (e.g., FastMRI, ImageNet ViT, OGBG) and model families, with patterns of adaptivity varying by parameter type and model. The results offer a principled direction for removing Shampoo’s heuristics and guiding the development of improved Kronecker-factorization-based training algorithms, while highlighting open theoretical questions on regret bounds and scalability to very large models.

Abstract

The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

TL;DR

This work interrogates Shampoo’s heuristics by decoupling its preconditioner eigenvalues from the eigenbasis and introducing eigenvalue-corrected variants (EShampoo and SOAP). It shows that learning rate grafting is a compensation for eigenvalue staleness, and that direct eigenvalue corrections can remove the need for grafting, yielding updates closer to full-matrix Adam. To address approximation error from stale eigenbases, the authors propose an adaptive warm-started QR criterion that selectively updates eigenbases per Kronecker factor, achieving notable efficiency gains while preserving convergence behavior. Empirically, adaptive eigenbasis updates and eigenvalue corrections demonstrate robust performance across diverse workloads (e.g., FastMRI, ImageNet ViT, OGBG) and model families, with patterns of adaptivity varying by parameter type and model. The results offer a principled direction for removing Shampoo’s heuristics and guiding the development of improved Kronecker-factorization-based training algorithms, while highlighting open theoretical questions on regret bounds and scalability to very large models.

Abstract

The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.

Paper Structure

This paper contains 21 sections, 6 theorems, 23 equations, 11 figures, 3 tables, 4 algorithms.

Key Result

Lemma 1

Let ${\bm{U}} = {\bm{Q}}_{{\bm{L}}} ({\bm{D}}^{\odot-p} \odot ({\bm{Q}}_{{\bm{L}}}^{\hbox{\m@th$\intercal$}}{\bm{G}} {\bm{Q}}_{{\bm{R}}})) {\bm{Q}}_{{\bm{R}}}^{\hbox{\m@th$\intercal$}}\in \mathbb{R}^{m \times n}$ be the generalized eigendecomposed Kronecker-factored update given by orthogonal matric

Figures (11)

  • Figure 1: Shampoo with stale preconditioner (updating the root inverse matrices every $F=100$ steps) without grafting for different choices of the learning rate $\alpha$ and $\epsilon$ on Imagewoof. All tested hyperparameter combinations underperform AdamW and, by extension, Shampoo with Adam grafting.
  • Figure 2: Training results with different Shampoo variants and eigendecomposition frequencies $F$ on the Imagewoof dataset. Shampoo with eigenvalue correction achieves a better training loss compared to Shampoo with Adam grafting, and the optimal learning rate for Adam transfers to both variants.
  • Figure 3: All configurations are for EShampoo. (left) On the Imagewoof ViT problem, setting the maximum number of iterations $I < 10$ with threshold $\tau = 0.2$ for the adaptive QR algorithm leads to significant increase in wall-clock time compared to using adaptive eigh. Even with $I = 10$, adaptive eigh is faster. The default SOAP setting achieves worse final loss and is also slightly slower. (right) Using the adaptive criterion to determine when to skip the eigendecomposition (eigh) improves efficiency by 20% in wall-clock time compared to updating every 100 iterations (AlgoPerf setting).
  • Figure 4: We show the mean with standard error across preconditioners corresponding to the labels in the legends, for EShampoo with $\tau=0.01$ and $F=1$ on Imagewoof ViT. The eigenbases for biases and layer normalization parameters are changing faster than for weight matrices and linear layers, respectively.
  • Figure 5: All configurations are for EShampoo. (left) The error in the eigenbases is dramatically more important for early iterations. A single eigenbasis computation at the first iteration ($\tau=0.99$) is sufficient to outperform AdamW. (right) The difference between the convergence behavior of AdamW and EShampoo on this problem can be exclusively attributed to the eigenbases corresponding to 2D parameters.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Lemma 1
  • Proposition 1
  • Lemma 1
  • proof
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Corollary 1
  • proof