Insights on Muon from Simple Quadratics

Antoine Gonon; Andreea-Alexandra Muşat; Nicolas Boumal

Insights on Muon from Simple Quadratics

Antoine Gonon, Andreea-Alexandra Muşat, Nicolas Boumal

TL;DR

This work analyzes Muon, an optimizer that uses an approximate polar factor to orthogonalize momentum before taking a step, on simple strongly convex quadratics. It shows that exact polar projection with fixed step sizes causes grid confinement, preventing convergence to arbitrarily small loss and yielding no better than $O(1/\sqrt{\varepsilon})$ end-to-end dependence, unlike GD’s $O(\log(1/\varepsilon))$ in the same setting. The authors further demonstrate that inexact projection can qualitatively alter dynamics: moderate stochastic perturbations can break confinement and even accelerate hitting times, and the advantage of approximate projection depends sensitively on the spectrum shape, not just conditioning. Finally, they show that per-step improvements on quadratics do not reliably predict end-to-end speed, underscoring the need for theories that jointly model projection error profiles and detailed spectral structure to explain Muon’s behavior in practice.

Abstract

Muon updates weight matrices along (approximate) polar factors of the gradients and has shown strong empirical performance in large-scale training. Existing attempts at explaining its performance largely focus on single-step comparisons (on quadratic proxies) and worst-case guarantees that treat the inexactness of the polar-factor as a nuisance ``to be argued away''. We show that already on simple strongly convex functions such as $L(W)=\frac12\|W\|_{\text{F}}^2$, these perspectives are insufficient, suggesting that understanding Muon requires going beyond local proxies and pessimistic worst-case bounds. Instead, our analysis exposes two observations that already affect behavior on simple quadratics and are not well captured by prevailing abstractions: (i) approximation error in the polar step can qualitatively alter discrete-time dynamics and improve reachability and finite-time performance -- an effect practitioners exploit to tune Muon, but that existing theory largely treats as a pure accuracy compromise; and (ii) structural properties of the objective affect finite-budget constants beyond the prevailing conditioning-based explanations. Thus, any general theory covering these cases must either incorporate these ingredients explicitly or explain why they are irrelevant in the regimes of interest.

Insights on Muon from Simple Quadratics

TL;DR

end-to-end dependence, unlike GD’s

in the same setting. The authors further demonstrate that inexact projection can qualitatively alter dynamics: moderate stochastic perturbations can break confinement and even accelerate hitting times, and the advantage of approximate projection depends sensitively on the spectrum shape, not just conditioning. Finally, they show that per-step improvements on quadratics do not reliably predict end-to-end speed, underscoring the need for theories that jointly model projection error profiles and detailed spectral structure to explain Muon’s behavior in practice.

Abstract

, these perspectives are insufficient, suggesting that understanding Muon requires going beyond local proxies and pessimistic worst-case bounds. Instead, our analysis exposes two observations that already affect behavior on simple quadratics and are not well captured by prevailing abstractions: (i) approximation error in the polar step can qualitatively alter discrete-time dynamics and improve reachability and finite-time performance -- an effect practitioners exploit to tune Muon, but that existing theory largely treats as a pure accuracy compromise; and (ii) structural properties of the objective affect finite-budget constants beyond the prevailing conditioning-based explanations. Thus, any general theory covering these cases must either incorporate these ingredients explicitly or explain why they are irrelevant in the regimes of interest.

Paper Structure (46 sections, 2 theorems, 73 equations, 16 figures, 8 tables)

This paper contains 46 sections, 2 theorems, 73 equations, 16 figures, 8 tables.

Code.
Introduction
Related Work
Exact Projection with Fixed Step Sizes: Grid Confinement, Rates, and What They (Don't) Explain
Global convergence to minimizer?
Best possible $\varepsilon$-dependence under fixed steps: no better than $O(1/\sqrt{\varepsilon})$
If not $\varepsilon$-dependence, then what? Conditioning and spectrum shape at a fixed budget
Approximate Projection Can Change the Dynamics (and Sometimes the Speed)
A minimal model: perturbed sign dynamics
Noise breaks confinement on the isotropic quadratic, and typical hitting times can be non-monotone
Quadratic evidence with actual Newton--Schulz projection: approximation can change who wins
One-Step Improvement on Quadratics is Not a Reliable Proxy for End-to-End Speed
Conclusion
Muon in the modded-nanoGPT speedrun setup (pseudocode)
Momentum variants for Muon-like methods
...and 31 more sections

Key Result

Proposition E.1

Consider the one-dimensional noisy sign dynamics with step size $\alpha>0$ and noise level $\sigma\ge 0$: where $(\xi_n)_{n\ge1}$ are i.i.d. $\mathcal{N}(0,1)$ random variables and $w_0^{(\sigma)}\in\mathbb{R}$ is fixed. Fix $\varepsilon>0$ and define the hitting time Assume $\sigma>0$. Then, for any initialization $w_0\in\mathbb{R}$, By contrast, for the deterministic dynamics ($\sigma = 0$),

Figures (16)

Figure 1: Eigenvalue distributions for the controlled-spectrum families. All families share the same endpoints $(s_{\min},s_{\max})=(10^{-3},10)$ and thus the same condition number $\kappa=s_{\max}/s_{\min} = 10^4$, but have different spectrum shapes.
Figure 2: Orders of magnitude of loss decrease after $T=500$ iterations on the controlled-spectrum quadratic family (\ref{['sec:setup-conditioning']}). For each spectrum shape (all sharing the same endpoints and condition number $\kappa=10^4$), bars are aligned at the common initial loss so that their lengths represent the logarithmic decrease achieved. The vertical axis is indexed by integers $-1,-2,\ldots$, corresponding to factors of $10^{-1},10^{-2},\ldots$ reduction. Muon achieves a comparable reduction across spectrum families, whereas GD exhibits variation, which shows that the flip in ranking is primarily driven by the fact that the performance of GD depends on the spectral distribution, while Muon remains comparatively stable across the spectra we test. For the max_spiked family, GD's loss is numerically very close to $0$. For plot readability, we clip the corresponding bar at roughly $-8$ orders of magnitude.
Figure 3: Averaged loss curves with shaded bands covering $95\%$ of trajectories for the quadratic objective $L(W)=\tfrac{1}{2}\langle W, A W\rangle + \langle B, W\rangle + c$, averaged over $100$ random initializations $[W_0]_{ij}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(0,1/n)$. Across panels, the matrix $A$ has the same condition number but different spectrum shapes (see \ref{['fig:eig-distributions-main']}). On the min_spiked spectrum, the loss achieved by GD is numerically very close to zero. We clip it at $10^{-5}$ for plot readability.
Figure 4: Median $\varepsilon$-hitting time exhibits a similar "sweet spot" where moderate projection error $\sigma$ in \ref{['eq:noisy-sign']} accelerates convergence. $T_\varepsilon^{(\sigma)}$ capped at $T_{\max}=1000$ steps (top dashed line). Noiseless cycling time $|{s}_0|/\alpha$ shown for reference (bottom dashed line). The shaded band shows the central $95\%$ interval of $T_\varepsilon^{(\sigma)}$ across runs. Details in \ref{['app:noisy-sign-xps']}.
Figure 5: Objective gap $L(W_t)-L(W_\star)$ for the quadratic $L(W)=\tfrac{1}{2} \langle W, A W\rangle$ (all matrices are $n\times n$, $n=100$). Here $A=QS Q^\top$ with $Q$ random orthogonal and $S=\operatorname{diag}(1,\ldots,1,10^3)$, so $\kappa=10^3$. We compare GD (exact line search along the Euclidean gradient) to a greedy policy that, at each iteration, picks the direction (gradient or Stiefel/polar) yielding the larger exact line-search decrease. On this instance, the greedy policy selects the Stiefel/polar step at every iteration, yet GD converges much faster end-to-end. Curves show the median across $100$ random initializations $W_0$ with i.i.d. entries $(W_0)_{ij}\sim\mathcal{N}(0,1/n)$; shaded bands contain the central $95\%$ of runs.
...and 11 more figures

Theorems & Definitions (4)

Proposition E.1: Noise breaks the deterministic grid trap
proof : Proof idea for \ref{['prop:noise-breaks-grid']}
Proposition E.2: Small- and large-noise have long hitting times
proof : Proof idea for \ref{['prop:small-large-noise-long-hitting']}

Insights on Muon from Simple Quadratics

TL;DR

Abstract

Insights on Muon from Simple Quadratics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (4)