Table of Contents
Fetching ...

An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants

Michael Crawshaw, Chirag Modi, Mingrui Liu, Robert M. Gower

TL;DR

This work unifies non-Euclidean gradient methods for neural networks by formulating a general steepest-descent framework operating on the full parameter space with per-layer norms and a product-norm aggregation. It introduces MuonMax, a robust variant derived from a new product norm, and shows how model truncation (Momo) can be integrated with arbitrary norms to further boost stability and hyperparameter-tuning robustness. By recasting Muon, Scion, and related methods within this framework, the authors derive closed-form update rules and demonstrate that MuonMax-Momo consistently matches or outperforms prior baselines while exhibiting far greater resilience to learning-rate choices, across 1B-token FineWeb and 6B-token SlimPajama tasks. The practical impact lies in enabling robust hyperparameter tuning and scalable training of large language models with reduced manual tuning, thanks to the combination of MuonMax and Momo within a principled norm-driven optimization scheme.

Abstract

To define a steepest descent method over a neural network, we need to choose a norm for each layer, a way to aggregate these norms across layers, and whether to use normalization. We systematically explore different alternatives for aggregating norms across layers, both formalizing existing combinations of Adam and the recently proposed Muon as a type of non-Euclidean gradient descent, and deriving new variants of the Muon optimizer. Through a comprehensive experimental evaluation of the optimizers within our framework, we find that Muon is sensitive to the choice of learning rate, whereas a new variant we call MuonMax is significantly more robust. We then show how to combine any non-Euclidean gradient method with model based momentum (known as Momo). The new Momo variants of Muon are significantly more robust to hyperparameter tuning, and often achieve a better validation score. Thus for new tasks, where the optimal hyperparameters are not known, we advocate for using Momo in combination with MuonMax to save on costly hyperparameter tuning.

An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants

TL;DR

This work unifies non-Euclidean gradient methods for neural networks by formulating a general steepest-descent framework operating on the full parameter space with per-layer norms and a product-norm aggregation. It introduces MuonMax, a robust variant derived from a new product norm, and shows how model truncation (Momo) can be integrated with arbitrary norms to further boost stability and hyperparameter-tuning robustness. By recasting Muon, Scion, and related methods within this framework, the authors derive closed-form update rules and demonstrate that MuonMax-Momo consistently matches or outperforms prior baselines while exhibiting far greater resilience to learning-rate choices, across 1B-token FineWeb and 6B-token SlimPajama tasks. The practical impact lies in enabling robust hyperparameter tuning and scalable training of large language models with reduced manual tuning, thanks to the combination of MuonMax and Momo within a principled norm-driven optimization scheme.

Abstract

To define a steepest descent method over a neural network, we need to choose a norm for each layer, a way to aggregate these norms across layers, and whether to use normalization. We systematically explore different alternatives for aggregating norms across layers, both formalizing existing combinations of Adam and the recently proposed Muon as a type of non-Euclidean gradient descent, and deriving new variants of the Muon optimizer. Through a comprehensive experimental evaluation of the optimizers within our framework, we find that Muon is sensitive to the choice of learning rate, whereas a new variant we call MuonMax is significantly more robust. We then show how to combine any non-Euclidean gradient method with model based momentum (known as Momo). The new Momo variants of Muon are significantly more robust to hyperparameter tuning, and often achieve a better validation score. Thus for new tasks, where the optimal hyperparameters are not known, we advocate for using Momo in combination with MuonMax to save on costly hyperparameter tuning.

Paper Structure

This paper contains 33 sections, 21 theorems, 100 equations, 6 figures, 4 tables, 3 algorithms.

Key Result

proposition 0

[Constrained Steepest Descent] The CSD update is given by

Figures (6)

  • Figure 1: Learning rate sweep for training GPT2-Large (774M params) on SlimPajama with 1B tokens. Left: Final validation loss for various learning rates. $\mathrm{MuonAdam}$ and $\mathrm{Scion}$ require precise tuning, whereas our $\mathrm{MuonAdam}$-$\mathrm{Momo}$ and $\mathrm{MuonMax}$-$\mathrm{Momo}$ achieve low loss for a significantly wider range of learning rates. Right: Training loss (with tuned LRs) for the last 40% of steps.
  • Figure 2: Final validation loss with varying learning rates on FineWeb1B (left) and SlimPajama6B (right). Our $\mathrm{MuonAdam}$-$\mathrm{Momo}$ and $\mathrm{MuonMax}$-$\mathrm{Momo}$ have wider basins than $\mathrm{MuonAdam}$ and $\mathrm{Scion}$, indicating increased robustness to learning rate tuning.
  • Figure 3: Sensitivity to loss lower bound $F_*$ for model truncation (Fineweb1B).
  • Figure 4: Training loss for the last 40% of training for FineWeb1B (left) and SlimPajama6B (right).
  • Figure 5: Effect of model truncation on final validation loss. Note that for these runs, we did not use stale nuclear norm approximations in order to isolate the effect of model truncation.
  • ...and 1 more figures

Theorems & Definitions (33)

  • proposition 0
  • proposition 0
  • lemma 0
  • proposition 0
  • proposition 0
  • proposition 0
  • proposition 0
  • proposition 0
  • proposition 0
  • proposition 0
  • ...and 23 more