Table of Contents
Fetching ...

Insights on Muon from Simple Quadratics

Antoine Gonon, Andreea-Alexandra Muşat, Nicolas Boumal

TL;DR

This work analyzes Muon, an optimizer that uses an approximate polar factor to orthogonalize momentum before taking a step, on simple strongly convex quadratics. It shows that exact polar projection with fixed step sizes causes grid confinement, preventing convergence to arbitrarily small loss and yielding no better than $O(1/\sqrt{\varepsilon})$ end-to-end dependence, unlike GD’s $O(\log(1/\varepsilon))$ in the same setting. The authors further demonstrate that inexact projection can qualitatively alter dynamics: moderate stochastic perturbations can break confinement and even accelerate hitting times, and the advantage of approximate projection depends sensitively on the spectrum shape, not just conditioning. Finally, they show that per-step improvements on quadratics do not reliably predict end-to-end speed, underscoring the need for theories that jointly model projection error profiles and detailed spectral structure to explain Muon’s behavior in practice.

Abstract

Muon updates weight matrices along (approximate) polar factors of the gradients and has shown strong empirical performance in large-scale training. Existing attempts at explaining its performance largely focus on single-step comparisons (on quadratic proxies) and worst-case guarantees that treat the inexactness of the polar-factor as a nuisance ``to be argued away''. We show that already on simple strongly convex functions such as $L(W)=\frac12\|W\|_{\text{F}}^2$, these perspectives are insufficient, suggesting that understanding Muon requires going beyond local proxies and pessimistic worst-case bounds. Instead, our analysis exposes two observations that already affect behavior on simple quadratics and are not well captured by prevailing abstractions: (i) approximation error in the polar step can qualitatively alter discrete-time dynamics and improve reachability and finite-time performance -- an effect practitioners exploit to tune Muon, but that existing theory largely treats as a pure accuracy compromise; and (ii) structural properties of the objective affect finite-budget constants beyond the prevailing conditioning-based explanations. Thus, any general theory covering these cases must either incorporate these ingredients explicitly or explain why they are irrelevant in the regimes of interest.

Insights on Muon from Simple Quadratics

TL;DR

This work analyzes Muon, an optimizer that uses an approximate polar factor to orthogonalize momentum before taking a step, on simple strongly convex quadratics. It shows that exact polar projection with fixed step sizes causes grid confinement, preventing convergence to arbitrarily small loss and yielding no better than end-to-end dependence, unlike GD’s in the same setting. The authors further demonstrate that inexact projection can qualitatively alter dynamics: moderate stochastic perturbations can break confinement and even accelerate hitting times, and the advantage of approximate projection depends sensitively on the spectrum shape, not just conditioning. Finally, they show that per-step improvements on quadratics do not reliably predict end-to-end speed, underscoring the need for theories that jointly model projection error profiles and detailed spectral structure to explain Muon’s behavior in practice.

Abstract

Muon updates weight matrices along (approximate) polar factors of the gradients and has shown strong empirical performance in large-scale training. Existing attempts at explaining its performance largely focus on single-step comparisons (on quadratic proxies) and worst-case guarantees that treat the inexactness of the polar-factor as a nuisance ``to be argued away''. We show that already on simple strongly convex functions such as , these perspectives are insufficient, suggesting that understanding Muon requires going beyond local proxies and pessimistic worst-case bounds. Instead, our analysis exposes two observations that already affect behavior on simple quadratics and are not well captured by prevailing abstractions: (i) approximation error in the polar step can qualitatively alter discrete-time dynamics and improve reachability and finite-time performance -- an effect practitioners exploit to tune Muon, but that existing theory largely treats as a pure accuracy compromise; and (ii) structural properties of the objective affect finite-budget constants beyond the prevailing conditioning-based explanations. Thus, any general theory covering these cases must either incorporate these ingredients explicitly or explain why they are irrelevant in the regimes of interest.
Paper Structure (46 sections, 2 theorems, 73 equations, 16 figures, 8 tables)

This paper contains 46 sections, 2 theorems, 73 equations, 16 figures, 8 tables.

Key Result

Proposition E.1

Consider the one-dimensional noisy sign dynamics with step size $\alpha>0$ and noise level $\sigma\ge 0$: where $(\xi_n)_{n\ge1}$ are i.i.d. $\mathcal{N}(0,1)$ random variables and $w_0^{(\sigma)}\in\mathbb{R}$ is fixed. Fix $\varepsilon>0$ and define the hitting time Assume $\sigma>0$. Then, for any initialization $w_0\in\mathbb{R}$, By contrast, for the deterministic dynamics ($\sigma = 0$),

Figures (16)

  • Figure 1: Eigenvalue distributions for the controlled-spectrum families. All families share the same endpoints $(s_{\min},s_{\max})=(10^{-3},10)$ and thus the same condition number $\kappa=s_{\max}/s_{\min} = 10^4$, but have different spectrum shapes.
  • Figure 2: Orders of magnitude of loss decrease after $T=500$ iterations on the controlled-spectrum quadratic family (\ref{['sec:setup-conditioning']}). For each spectrum shape (all sharing the same endpoints and condition number $\kappa=10^4$), bars are aligned at the common initial loss so that their lengths represent the logarithmic decrease achieved. The vertical axis is indexed by integers $-1,-2,\ldots$, corresponding to factors of $10^{-1},10^{-2},\ldots$ reduction. Muon achieves a comparable reduction across spectrum families, whereas GD exhibits variation, which shows that the flip in ranking is primarily driven by the fact that the performance of GD depends on the spectral distribution, while Muon remains comparatively stable across the spectra we test. For the max_spiked family, GD's loss is numerically very close to $0$. For plot readability, we clip the corresponding bar at roughly $-8$ orders of magnitude.
  • Figure 3: Averaged loss curves with shaded bands covering $95\%$ of trajectories for the quadratic objective $L(W)=\tfrac{1}{2}\langle W, A W\rangle + \langle B, W\rangle + c$, averaged over $100$ random initializations $[W_0]_{ij}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(0,1/n)$. Across panels, the matrix $A$ has the same condition number but different spectrum shapes (see \ref{['fig:eig-distributions-main']}). On the min_spiked spectrum, the loss achieved by GD is numerically very close to zero. We clip it at $10^{-5}$ for plot readability.
  • Figure 4: Median $\varepsilon$-hitting time exhibits a similar "sweet spot" where moderate projection error $\sigma$ in \ref{['eq:noisy-sign']} accelerates convergence. $T_\varepsilon^{(\sigma)}$ capped at $T_{\max}=1000$ steps (top dashed line). Noiseless cycling time $|{s}_0|/\alpha$ shown for reference (bottom dashed line). The shaded band shows the central $95\%$ interval of $T_\varepsilon^{(\sigma)}$ across runs. Details in \ref{['app:noisy-sign-xps']}.
  • Figure 5: Objective gap $L(W_t)-L(W_\star)$ for the quadratic $L(W)=\tfrac{1}{2} \langle W, A W\rangle$ (all matrices are $n\times n$, $n=100$). Here $A=QS Q^\top$ with $Q$ random orthogonal and $S=\operatorname{diag}(1,\ldots,1,10^3)$, so $\kappa=10^3$. We compare GD (exact line search along the Euclidean gradient) to a greedy policy that, at each iteration, picks the direction (gradient or Stiefel/polar) yielding the larger exact line-search decrease. On this instance, the greedy policy selects the Stiefel/polar step at every iteration, yet GD converges much faster end-to-end. Curves show the median across $100$ random initializations $W_0$ with i.i.d. entries $(W_0)_{ij}\sim\mathcal{N}(0,1/n)$; shaded bands contain the central $95\%$ of runs.
  • ...and 11 more figures

Theorems & Definitions (4)

  • Proposition E.1: Noise breaks the deterministic grid trap
  • proof : Proof idea for \ref{['prop:noise-breaks-grid']}
  • Proposition E.2: Small- and large-noise have long hitting times
  • proof : Proof idea for \ref{['prop:small-large-noise-long-hitting']}