Table of Contents
Fetching ...

Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization

Wu Lin, Scott C. Lowe, Felix Dangel, Runa Eschenhagen, Zikun Xu, Roger B. Grosse

TL;DR

This work reframes Shampoo and SOAP as covariance-estimation procedures under Kullback-Leibler divergence, revealing limitations of their Kronecker-based estimators when jointly learning factors. It develops KL-Shampoo and KL-SOAP, practical KL-based schemes with QR-based implementations and EMA-based eigenvalue updates that avoid Adam’s memory overhead yet match or exceed competing methods. Empirically, KL-Shampoo consistently outperforms Shampoo, SOAP, and KL-SOAP across diverse neural-network pre-training tasks, including tensor-valued weights, highlighting the advantages of a KL-driven approach. The study provides a unifying framework for designing SPD-preserving, divergence-based preconditioners and extends naturally to tensor-valued settings, offering a principled path for future structured optimization in neural networks.

Abstract

Shampoo and its efficient variant, SOAP, employ structured second-moment estimations and have shown strong performance for training neural networks (NNs). In practice, however, Shampoo typically requires step-size grafting with Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo's eigenbasis -- at the cost of additional memory overhead from Adam in both methods. Prior analyses have largely relied on the Frobenius norm to motivate these estimation schemes. We instead recast their estimation procedures as covariance estimation under Kullback-Leibler (KL) divergence minimization, revealing a previously overlooked theoretical limitation and motivating principled redesigns. Building on this perspective, we develop $\textbf{KL-Shampoo}$ and $\textbf{KL-SOAP}$, practical schemes that match or exceed the performance of Shampoo and SOAP in NN pre-training while achieving SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to attain competitive performance, eliminating the memory overhead introduced by Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP, Shampoo, and even KL-SOAP, establishing the KL-based approach as a promising foundation for designing structured methods in NN optimization.

Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization

TL;DR

This work reframes Shampoo and SOAP as covariance-estimation procedures under Kullback-Leibler divergence, revealing limitations of their Kronecker-based estimators when jointly learning factors. It develops KL-Shampoo and KL-SOAP, practical KL-based schemes with QR-based implementations and EMA-based eigenvalue updates that avoid Adam’s memory overhead yet match or exceed competing methods. Empirically, KL-Shampoo consistently outperforms Shampoo, SOAP, and KL-SOAP across diverse neural-network pre-training tasks, including tensor-valued weights, highlighting the advantages of a KL-driven approach. The study provides a unifying framework for designing SPD-preserving, divergence-based preconditioners and extends naturally to tensor-valued settings, offering a principled path for future structured optimization in neural networks.

Abstract

Shampoo and its efficient variant, SOAP, employ structured second-moment estimations and have shown strong performance for training neural networks (NNs). In practice, however, Shampoo typically requires step-size grafting with Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo's eigenbasis -- at the cost of additional memory overhead from Adam in both methods. Prior analyses have largely relied on the Frobenius norm to motivate these estimation schemes. We instead recast their estimation procedures as covariance estimation under Kullback-Leibler (KL) divergence minimization, revealing a previously overlooked theoretical limitation and motivating principled redesigns. Building on this perspective, we develop and , practical schemes that match or exceed the performance of Shampoo and SOAP in NN pre-training while achieving SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to attain competitive performance, eliminating the memory overhead introduced by Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP, Shampoo, and even KL-SOAP, establishing the KL-based approach as a promising foundation for designing structured methods in NN optimization.

Paper Structure

This paper contains 38 sections, 41 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Empirical results (random search using 150 runs for each method) on language models demonstrate the advantages of KL-based methods over Shampoo and SOAP while matching SOAP's per-iteration runtime. All methods take the same number of iterations in these experiments. Surprisingly, KL-Shampoo also outperforms KL-SOAP. We include the best Shampoo run based on a state-of-the-art implementation from Meta shi2023distributed in the plots. See \ref{['fig:larger']} in \ref{['app:extra_exp']} for evaluating KL-Shampoo on a larger model (Llama3 with 450M).
  • Figure 2: Empirical results (random search using 150 runs for each method) on language models demonstrate that KL-Shampoo does not rely on step-size grafting with Adam to perform well. Shampoo without grafting does not perform well, even when using the state-of-the-art implementation shi2023distributed. In particular, Shampoo with power $p=1/2$ fails to train the RWKV7 model in all 150 runs when grafting is disabled.
  • Figure 3: Left: Simplified Shampoo-based schemes without momentum, damping, and weight decay. See \ref{['fig:ema_shampoo_comparison']} for an empirical comparison and \ref{['box:pkl-shampoo']} for the practical KL-Shampoo. Top Right: For computational efficiency, we replace the eigen step with our exponential moving average (EMA) scheme to estimate eigenvalues and infrequent eigenbasis estimation using QR, where we estimate eigenvalues ${\bm{\lambda}}_k$ using an outdated eigenbasis ${\bm{Q}}_k$ for $k \in \{a,b\}$, and use the QR procedure to estimate ${\bm{Q}}_k$. Bottom Right: Simplified SOAP-based schemes without momentum. Notably, KL-SOAP needs estimation for ${\bm{\lambda}}_k$ in Step 3a to compute the eigenbasis ${\bm{Q}}$, whereas SOAP does not. Here, we view RMSProp's 2nd moment in the eigenbasis as augmented eigenvalues highlighted in blue.
  • Figure 4: Empirical results (random search using 150 runs for each method) demonstrate that our EMA scheme for the eigenvalue estimation makes KL-Shampoo competitive when using an outdated eigenbasis. Without this scheme, KL-Shampoo performs poorly under an outdated eigenbasis ${\bm{Q}}_k$ even when employing the instantaneous eigenvalue estimation ${\bm{\lambda}}_k^\text{(inst)}=\mathrm{diag}({\bm{Q}}_k^\top {\bm{S}}_k {\bm{Q}}_k)$ at every iteration, as suggested by eschenhagen2025purifying for $k \in \{a,b\}$. Adapting the EMA scheme also makes other variants of Shampoo competitive (\ref{['fig:frob_shampoo_ema', 'fig:trace_shampoo_ema']}, \ref{['app:extra_exp']}) and allows the trace-scaling variant to outperform SOAP (\ref{['fig:trace_shampoo']}, \ref{['app:extra_exp']}).
  • Figure 5: Empirical results---based on random search with 150 runs per method---demonstrate the advantages of KL-Shampoo's (two-sided) estimation over other Shampoo variants under comparable settings for NN training, including Shampoo with $p=1/2$ (no grafting, \ref{['eq:shampoo']}), F-Shampoo (two-sided, Frobenius-norm–based, \ref{['fig:f_shampoo']}), and VN-Shampoo (trace scaling, two-sided von-Neumann-divergence-based, \ref{['fig:vn_shampoo']}). We make these variants practical by incorporating a QR step and an EMA scheme for eigenvalue estimation (\ref{['box:qr_for_kl']}). To ensure a fair comparison and minimize implementation bias, we implement Shampoo, F-Shampoo, and VN-Shampoo ourselves, aligning them closely with KL-Shampoo. See \ref{['fig:trace_shampoo']} (\ref{['app:extra_exp']}) for a detailed comparison between KL-Shampoo and VN-Shampoo.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Claim 1
  • Claim 2
  • Claim 3
  • Claim 4
  • Claim 5
  • Claim 6
  • Claim 7