Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization
Wu Lin, Scott C. Lowe, Felix Dangel, Runa Eschenhagen, Zikun Xu, Roger B. Grosse
TL;DR
This work reframes Shampoo and SOAP as covariance-estimation procedures under Kullback-Leibler divergence, revealing limitations of their Kronecker-based estimators when jointly learning factors. It develops KL-Shampoo and KL-SOAP, practical KL-based schemes with QR-based implementations and EMA-based eigenvalue updates that avoid Adam’s memory overhead yet match or exceed competing methods. Empirically, KL-Shampoo consistently outperforms Shampoo, SOAP, and KL-SOAP across diverse neural-network pre-training tasks, including tensor-valued weights, highlighting the advantages of a KL-driven approach. The study provides a unifying framework for designing SPD-preserving, divergence-based preconditioners and extends naturally to tensor-valued settings, offering a principled path for future structured optimization in neural networks.
Abstract
Shampoo and its efficient variant, SOAP, employ structured second-moment estimations and have shown strong performance for training neural networks (NNs). In practice, however, Shampoo typically requires step-size grafting with Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo's eigenbasis -- at the cost of additional memory overhead from Adam in both methods. Prior analyses have largely relied on the Frobenius norm to motivate these estimation schemes. We instead recast their estimation procedures as covariance estimation under Kullback-Leibler (KL) divergence minimization, revealing a previously overlooked theoretical limitation and motivating principled redesigns. Building on this perspective, we develop $\textbf{KL-Shampoo}$ and $\textbf{KL-SOAP}$, practical schemes that match or exceed the performance of Shampoo and SOAP in NN pre-training while achieving SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to attain competitive performance, eliminating the memory overhead introduced by Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP, Shampoo, and even KL-SOAP, establishing the KL-based approach as a promising foundation for designing structured methods in NN optimization.
