A Note on the Convergence of Muon
Jiaxiang Li, Mingyi Hong
TL;DR
The note analyzes convergence properties of the Muon optimizer, a singular-value decomposition–based momentum-like stochastic optimizer for matrix-valued objectives. It develops a Frobenius-norm descent lemma and explicit convergence rates under standard smoothness and noise assumptions, detailing how step-size, momentum, and batch size shape performance. It further examines a related heavy-ball/minibatch scheme with a spectral-norm objective, providing a bound that depends on the initial optimal gap and noise variance, and discusses fundamental limitations on batch-free convergence. Together, the results illuminate parameter tuning trade-offs for Muon during large-model pretraining and contribute to the theoretical understanding of SVD-based optimization updates in stochastic settings.
Abstract
In this note, we inspect the convergence of a new optimizer for pretraining LLMs, namely the Muon optimizer. Such an optimizer is closely related to a specialized steepest descent method where the update direction is the minimizer of the quadratic approximation of the objective function under spectral norm. We provide the convergence analysis on both versions of the optimizer and discuss its implications.
