Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner
Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, Hao-Jun Michael Shi
TL;DR
This work interrogates Shampoo’s heuristics by decoupling its preconditioner eigenvalues from the eigenbasis and introducing eigenvalue-corrected variants (EShampoo and SOAP). It shows that learning rate grafting is a compensation for eigenvalue staleness, and that direct eigenvalue corrections can remove the need for grafting, yielding updates closer to full-matrix Adam. To address approximation error from stale eigenbases, the authors propose an adaptive warm-started QR criterion that selectively updates eigenbases per Kronecker factor, achieving notable efficiency gains while preserving convergence behavior. Empirically, adaptive eigenbasis updates and eigenvalue corrections demonstrate robust performance across diverse workloads (e.g., FastMRI, ImageNet ViT, OGBG) and model families, with patterns of adaptivity varying by parameter type and model. The results offer a principled direction for removing Shampoo’s heuristics and guiding the development of improved Kronecker-factorization-based training algorithms, while highlighting open theoretical questions on regret bounds and scalability to very large models.
Abstract
The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.
