AdaMuon: Adaptive Muon Optimizer
Chongjie Si, Debing Zhang, Wei Shen
TL;DR
AdaMuon targets efficient, stable large-scale neural network training by marrying Muon's geometry-preserving orthogonal updates with coordinate-wise variance adaptivity. It introduces an element-wise second-moment estimator on orthogonal updates, a sign-based stabilization step before polar decomposition, and an RMS-alignment scheme to maintain compatibility with Adam learning-rate schedules. Empirical results on GPT-2 and Qwen2.5 show AdaMuon delivering substantial training-efficiency gains (up to ~40% over Adam) and strong benchmark performance across 15 tasks, demonstrating robustness across model scales. The work offers a practical, scalable second-order-adjacent optimizer that retains Muon’s stability while enabling per-coordinate adaptation in large foundation-model training.
Abstract
We propose AdaMuon, a novel optimizer that combines element-wise adaptivity with orthogonal updates for large-scale neural network training. AdaMuon incorporates two tightly coupled mechanisms: (1) an element-wise second momentum estimator applied to orthogonalized update directions, and (2) a sign-stabilized orthogonal update, where the momentum is first sign-transformed before orthogonalization. These two components jointly enable variance-adaptive scaling while maintaining stable update geometry. In addition, AdaMuon employs an RMS-aligned rescaling strategy to match the root-mean-square update magnitude to Adam, allowing direct reuse of existing learning rate schedules without extra tuning. Experiments demonstrate that AdaMuon not only maintains stability but can surpass Adam by more than 40\% training efficiency in large-scale scenarios.
