Insights on Muon from Simple Quadratics
Antoine Gonon, Andreea-Alexandra Muşat, Nicolas Boumal
TL;DR
This work analyzes Muon, an optimizer that uses an approximate polar factor to orthogonalize momentum before taking a step, on simple strongly convex quadratics. It shows that exact polar projection with fixed step sizes causes grid confinement, preventing convergence to arbitrarily small loss and yielding no better than $O(1/\sqrt{\varepsilon})$ end-to-end dependence, unlike GD’s $O(\log(1/\varepsilon))$ in the same setting. The authors further demonstrate that inexact projection can qualitatively alter dynamics: moderate stochastic perturbations can break confinement and even accelerate hitting times, and the advantage of approximate projection depends sensitively on the spectrum shape, not just conditioning. Finally, they show that per-step improvements on quadratics do not reliably predict end-to-end speed, underscoring the need for theories that jointly model projection error profiles and detailed spectral structure to explain Muon’s behavior in practice.
Abstract
Muon updates weight matrices along (approximate) polar factors of the gradients and has shown strong empirical performance in large-scale training. Existing attempts at explaining its performance largely focus on single-step comparisons (on quadratic proxies) and worst-case guarantees that treat the inexactness of the polar-factor as a nuisance ``to be argued away''. We show that already on simple strongly convex functions such as $L(W)=\frac12\|W\|_{\text{F}}^2$, these perspectives are insufficient, suggesting that understanding Muon requires going beyond local proxies and pessimistic worst-case bounds. Instead, our analysis exposes two observations that already affect behavior on simple quadratics and are not well captured by prevailing abstractions: (i) approximation error in the polar step can qualitatively alter discrete-time dynamics and improve reachability and finite-time performance -- an effect practitioners exploit to tune Muon, but that existing theory largely treats as a pure accuracy compromise; and (ii) structural properties of the objective affect finite-budget constants beyond the prevailing conditioning-based explanations. Thus, any general theory covering these cases must either incorporate these ingredients explicitly or explain why they are irrelevant in the regimes of interest.
