Table of Contents
Fetching ...

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, Andrew Gordon Wilson

TL;DR

<3-5 sentence high-level summary>The paper investigates how to scale matrix-preconditioned optimizers (Shampoo, Muon, SOAP) with width and depth by deriving hyperparameter transfer rules under the Maximal Update Parameterization (μP). It shows that μP improves learning-rate transfer across widths, but finite-width effects can skew optimal scales unless mitigated by blocking and explicit spectral normalization; depth scaling is addressed via a 1/L residual multiplier. In compute-optimal regimes, the authors find that μP plus 1/D weight decay yields near-optimal transfer, enabling Muon and Shampoo to achieve consistent speedups (~1.4x and ~1.3x) over AdamW on transformers from 190M to 1.4B parameters. Overall, the work argues that robust hyperparameter transfer is essential for fair, scalable comparisons of optimizers at large scale and provides practical guidelines for achieving consistent gains.

Abstract

Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results. To better understand the effectiveness of these optimizers at scale, in this work we investigate how to scale preconditioned optimizers via hyperparameter transfer, building on prior works such as $μ$P. We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to $μ$P improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as $1/\mathrm{width}$ is nearly optimal across optimizers. Applying these scaling rules, we show Muon and Shampoo consistently achieve $1.4\times$ and $1.3\times$ speedup over AdamW for training Llama-architecture language models of sizes ranging from $190$M to $1.4$B, whereas the speedup vanishes rapidly with scale under incorrect scaling. Based on these results and further ablations, we argue that studying optimal hyperparameter transfer is essential for reliably comparing optimizers at scale given a realistic tuning budget.

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

TL;DR

<3-5 sentence high-level summary>The paper investigates how to scale matrix-preconditioned optimizers (Shampoo, Muon, SOAP) with width and depth by deriving hyperparameter transfer rules under the Maximal Update Parameterization (μP). It shows that μP improves learning-rate transfer across widths, but finite-width effects can skew optimal scales unless mitigated by blocking and explicit spectral normalization; depth scaling is addressed via a 1/L residual multiplier. In compute-optimal regimes, the authors find that μP plus 1/D weight decay yields near-optimal transfer, enabling Muon and Shampoo to achieve consistent speedups (~1.4x and ~1.3x) over AdamW on transformers from 190M to 1.4B parameters. Overall, the work argues that robust hyperparameter transfer is essential for fair, scalable comparisons of optimizers at large scale and provides practical guidelines for achieving consistent gains.

Abstract

Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results. To better understand the effectiveness of these optimizers at scale, in this work we investigate how to scale preconditioned optimizers via hyperparameter transfer, building on prior works such as P. We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to P improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as is nearly optimal across optimizers. Applying these scaling rules, we show Muon and Shampoo consistently achieve and speedup over AdamW for training Llama-architecture language models of sizes ranging from M to B, whereas the speedup vanishes rapidly with scale under incorrect scaling. Based on these results and further ablations, we argue that studying optimal hyperparameter transfer is essential for reliably comparing optimizers at scale given a realistic tuning budget.

Paper Structure

This paper contains 88 sections, 132 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Hyperparameter transfer is crucial for achieving good performance with matrix-preconditioned optimizers across scales. (Left) We derive $\mu$P scaling for stabilizing the optimal learning rate across widths. Fixing the base learning rate, $\mu$P yields consistent training dynamics across widths for transformers on openwebtext and that wider is always better, whereas the standard parameterization (SP) leads to instability and unstable optima (\ref{['sec:transfer']}). Muon-Adam uses Muon in hidden layers and Adam in embedding and readout layers. Adam#Shampoo is Shampoo with Adam-grafting. Shampoo and SOAP use block size of 128. (Right) Combining $\mu$P with $1/\mathrm{width}$ independent weight decay, Muon and Shampoo achieve consistent $1.4\times$ and $1.3\times$ speedups over well-tuned AdamW in training 190M to 1.4B-parameter models on FineWeb. By contrast, SP rapidly deteriorates the performance of both optimizers as they scale (\ref{['sec:scaling']}).
  • Figure 2: $\mu$P leads to better but imperfect learning rate transfer for matrix-preconditioned optimizers. (Top) The optimal learning rate is more consistent across widths $D$ under $\mu$P for transformers trained on OpenWebText. We show the learning rate as the multiplier $\eta_\mathrm{base}/\eta_0,$ where $\eta_0$ is the optimal learning rate for the base model for each optimizer. (Bottom) $\mu$P achieves lower loss in zero-shot transferring the optimal learning rate found in the base model ($D=128$) to larger models (up to $D = 4096$) and passes the "coordinate check": RMS of the one-step feature update in the last layer is invariant to width in early training (step 10), except for SOAP. We explain the imperfect transfer of $\mu$P and why it fails for SOAP in \ref{['sec:mup_empirical']} . # stands for Adam-grafting, Muon-Adam uses Adam for the embedding and readout, and Shampoo$^2$ uses $e_L=e_R=1/2.$
  • Figure 3: Blocking and explicit normalization reduce finite-width deviations and improves transfer. (Left) With a fixed block size of 128 , $\mu$P consistently achieves good learning rate transfer, including for Shampoo with grafting and SOAP which otherwise have unstable optimal learning rates (\ref{['fig:width-scaling']}). (Right) Explicit spectral normalization (Norm) improves learning rate transfer where $\mu$P alone fails (no blocking used here).
  • Figure 4: Depthwise learning rate transfer is effective for all tested optimizers. We apply a $1/L$ residual branch multiplier ($\alpha =1$) and adjust the learning rate to ensure $\Theta(1)$ feature learning in each layer, folllowing dey2025don, outperforming SP ($\alpha = 0$) and stabilizing the size of early-time (step 10) feature update $\Delta h$ when transferring from 3 to 192-layer transformers on OpenWebText. We provide experiment details in \ref{['app:depth-experiments']}
  • Figure 5: Learning rate and weight decay transfer on FineWeb under compute-optimal training. (Left) $\mu$P approximately stabilizes the optimal learning rate, while spectral normalization reduces learning rate sensitivity and achieves slightly better performance. Best indicates taking the minimum over all learning rates and parameterizations. (Right) Optimal independent weight decay scales like $1/D$. Muon uses Adam in the embedding layer. Shampoo and SOAP use a block size of 512. For Shampoo, we use Adam-grafting and Adam in the embedding and readout, which we found to perform better than applying one-sided Shampoo.
  • ...and 8 more figures