Table of Contents
Fetching ...

AdaMuon: Adaptive Muon Optimizer

Chongjie Si, Debing Zhang, Wei Shen

TL;DR

AdaMuon targets efficient, stable large-scale neural network training by marrying Muon's geometry-preserving orthogonal updates with coordinate-wise variance adaptivity. It introduces an element-wise second-moment estimator on orthogonal updates, a sign-based stabilization step before polar decomposition, and an RMS-alignment scheme to maintain compatibility with Adam learning-rate schedules. Empirical results on GPT-2 and Qwen2.5 show AdaMuon delivering substantial training-efficiency gains (up to ~40% over Adam) and strong benchmark performance across 15 tasks, demonstrating robustness across model scales. The work offers a practical, scalable second-order-adjacent optimizer that retains Muon’s stability while enabling per-coordinate adaptation in large foundation-model training.

Abstract

We propose AdaMuon, a novel optimizer that combines element-wise adaptivity with orthogonal updates for large-scale neural network training. AdaMuon incorporates two tightly coupled mechanisms: (1) an element-wise second momentum estimator applied to orthogonalized update directions, and (2) a sign-stabilized orthogonal update, where the momentum is first sign-transformed before orthogonalization. These two components jointly enable variance-adaptive scaling while maintaining stable update geometry. In addition, AdaMuon employs an RMS-aligned rescaling strategy to match the root-mean-square update magnitude to Adam, allowing direct reuse of existing learning rate schedules without extra tuning. Experiments demonstrate that AdaMuon not only maintains stability but can surpass Adam by more than 40\% training efficiency in large-scale scenarios.

AdaMuon: Adaptive Muon Optimizer

TL;DR

AdaMuon targets efficient, stable large-scale neural network training by marrying Muon's geometry-preserving orthogonal updates with coordinate-wise variance adaptivity. It introduces an element-wise second-moment estimator on orthogonal updates, a sign-based stabilization step before polar decomposition, and an RMS-alignment scheme to maintain compatibility with Adam learning-rate schedules. Empirical results on GPT-2 and Qwen2.5 show AdaMuon delivering substantial training-efficiency gains (up to ~40% over Adam) and strong benchmark performance across 15 tasks, demonstrating robustness across model scales. The work offers a practical, scalable second-order-adjacent optimizer that retains Muon’s stability while enabling per-coordinate adaptation in large foundation-model training.

Abstract

We propose AdaMuon, a novel optimizer that combines element-wise adaptivity with orthogonal updates for large-scale neural network training. AdaMuon incorporates two tightly coupled mechanisms: (1) an element-wise second momentum estimator applied to orthogonalized update directions, and (2) a sign-stabilized orthogonal update, where the momentum is first sign-transformed before orthogonalization. These two components jointly enable variance-adaptive scaling while maintaining stable update geometry. In addition, AdaMuon employs an RMS-aligned rescaling strategy to match the root-mean-square update magnitude to Adam, allowing direct reuse of existing learning rate schedules without extra tuning. Experiments demonstrate that AdaMuon not only maintains stability but can surpass Adam by more than 40\% training efficiency in large-scale scenarios.

Paper Structure

This paper contains 40 sections, 7 theorems, 39 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $f:\mathbb{R}\to\mathbb{R}$ be a function applied element-wise to $\mathbf{M}_t$ before the polar operator $g(\cdot)=\mathrm{polar}(\cdot)$. Suppose $f$ satisfies the following conditions: Then $f$ must be of the form $f(x)=c\cdot \mathrm{sign}(x), \ c>0$. Moreover, since $g$ is globally scale-invariant, the multiplicative constant $c$ is immaterial, and the unique canonical choice is $f(x)=\

Figures (6)

  • Figure 1: Training and validation loss comparisons of AdamW, Muon, and AdaMuon.
  • Figure 2: Results of AdamW, Muon, and AdaMuon when training Qwen2.5-1.5B and 7B dense models.
  • Figure 3: Ablation Study of AdaMuon. We present the training loss curve of GPT-2 Small and Qwen2.5-1.5B models.
  • Figure 4: Training behavior of AdamW, Muon, and AdaMuon.
  • Figure 5: Results of AdamW and AdaMuon when training DeepSeek V3 models.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Theorem 1: Characterization of admissible element-wise transformations
  • Lemma 1: Preconditioner bounds and consequences
  • proof
  • Lemma 2: RMS alignment fixes the step norm
  • proof
  • Proposition 1: One-step inequality
  • proof
  • Theorem 2: Diminishing steps $\eta_t=\eta_0/\sqrt{t}$
  • Theorem 3: Constant steps $\eta_t\equiv\eta$
  • Lemma 3: limit superior bound
  • ...and 1 more