Table of Contents
Fetching ...

What Really Matters in Matrix-Whitening Optimizers?

Kevin Frans, Pieter Abbeel, Sergey Levine

TL;DR

This work investigates why matrix-whitening optimizers outperform elementwise methods by decomposing their core components. Through a controlled experimental framework across multiple optimizer families (e.g., Shampoo, SOAP, Muon) and a GPT-2–style Transformer setup, it shows that gains arise from both spectral normalization and variance adaptation, with variance adaptation being the crucial, often overlooked ingredient. It also demonstrates that lookahead strategies do not reliably replace variance adaptation, and low-rank variance estimators can substantially reduce memory without sacrificing performance. The findings advocate a modular optimizer design where spectral normalization and variance adaptation are decoupled, enabling more efficient and scalable training methods for large neural networks.

Abstract

A range of recent optimizers have emerged that approximate the same "matrix-whitening" transformation in various ways. In this work, we systematically deconstruct such optimizers, aiming to disentangle the key components that explain performance. Across tuned hyperparameters across the board, all flavors of matrix-whitening methods reliably outperform elementwise counterparts, such as Adam. Matrix-whitening is often related to spectral descent -- however, experiments reveal that performance gains are *not explained solely by accurate spectral normalization* -- particularly, SOAP displays the largest per-step gain, even though Muon more accurately descends along the steepest spectral descent direction. Instead, we argue that matrix-whitening serves two purposes, and the variance adaptation component of matrix-whitening is the overlooked ingredient explaining this performance gap. Experiments show that variance-adapted versions of optimizers consistently outperform their sign-descent counterparts, including an adaptive version of Muon. We further ablate variance adaptation strategies, finding that while lookahead style approximations are not as effective, low-rank variance estimators can effectively reduce memory costs without a performance loss.

What Really Matters in Matrix-Whitening Optimizers?

TL;DR

This work investigates why matrix-whitening optimizers outperform elementwise methods by decomposing their core components. Through a controlled experimental framework across multiple optimizer families (e.g., Shampoo, SOAP, Muon) and a GPT-2–style Transformer setup, it shows that gains arise from both spectral normalization and variance adaptation, with variance adaptation being the crucial, often overlooked ingredient. It also demonstrates that lookahead strategies do not reliably replace variance adaptation, and low-rank variance estimators can substantially reduce memory without sacrificing performance. The findings advocate a modular optimizer design where spectral normalization and variance adaptation are decoupled, enabling more efficient and scalable training methods for large neural networks.

Abstract

A range of recent optimizers have emerged that approximate the same "matrix-whitening" transformation in various ways. In this work, we systematically deconstruct such optimizers, aiming to disentangle the key components that explain performance. Across tuned hyperparameters across the board, all flavors of matrix-whitening methods reliably outperform elementwise counterparts, such as Adam. Matrix-whitening is often related to spectral descent -- however, experiments reveal that performance gains are *not explained solely by accurate spectral normalization* -- particularly, SOAP displays the largest per-step gain, even though Muon more accurately descends along the steepest spectral descent direction. Instead, we argue that matrix-whitening serves two purposes, and the variance adaptation component of matrix-whitening is the overlooked ingredient explaining this performance gap. Experiments show that variance-adapted versions of optimizers consistently outperform their sign-descent counterparts, including an adaptive version of Muon. We further ablate variance adaptation strategies, finding that while lookahead style approximations are not as effective, low-rank variance estimators can effectively reduce memory costs without a performance loss.

Paper Structure

This paper contains 15 sections, 11 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Our experimental setup aims to isolate the core effects of various matrix-whitening optimizers on Transformer training. For each method, we sweep over learning rate, weight decay, $\beta_1$, and $\beta_2$. All runs use the same initial parameters and data ordering. Nonstandard parameters (embed, output, and layernorm) are optimized using Adam with fixed tuned hyperparameters.
  • Figure 2: All methods are tuned to within a local optimum of four key hyperparameters. Matrix-whitening optimizers generally maintain their relative performance gains across local adjustments to hyperparameters. Plots are centered around each method's optimal hyperparameters.
  • Figure 3: Left: Muon descends under the spectral norm more accurately than SOAP or SPlus. This is achieved when all singular values in the update are $\pm 1$, and accordingly the ratio between the maximum and average is close to 1. In contrast, the Shampoo-style methods perform this only loosely, with a ratio between $2$ to $3$. Adam results in a ratio of $\approx 12$ (not plotted). Right: Even with increased computation, Muon or SPlus do not reach the empirical performance of SOAP. For Muon, we increase the number of Newton-Schulz iterations at each step. For SPlus, we increase the frequency of updating the eigenbasis. The red dotted line represents the performance of SOAP-100.
  • Figure 4: Variance-adapted variants of optimizers outperform their strictly signed-descent counterparts. As elaborated more in \ref{['table:ablations']}, these improvements remain when the variance buffer is factorized into a rank-1 approximation, as well as to a less degree when $\beta_1=\beta_2$. Variance adaptation can be interpreted as imposing a signal-to-noise dependent adaptive trust region, composable with the rotational or spectral-normalizing aspects of matrix-whitening.
  • Figure 5: SOAP-100, with matrices preconditioned using only one side.
  • ...and 1 more figures