Table of Contents
Fetching ...

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, Jiawei Zhang

TL;DR

This work analyzes the convergence of Muon, a matrix-parameter optimizer that orthogonalizes gradient updates, across nonconvex and star-convex regimes. It derives convergence guarantees under both Frobenius- and spectral-norm Lipschitz smoothness without requiring uniform smoothness and shows Muon can outperform gradient methods when Hessians are low-rank and approximately blockwise diagonal. Theoretical results are corroborated by experiments on neural-network-like settings and quadratic functions, illustrating conditions under which Muon has advantages. Overall, the paper highlights how exploiting matrix structure in optimization can yield improved convergence behavior and outlines directions for exploiting Hessian structure in future optimizer design.

Abstract

The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We further characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices -- phenomena widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.

On the Convergence Analysis of Muon

TL;DR

This work analyzes the convergence of Muon, a matrix-parameter optimizer that orthogonalizes gradient updates, across nonconvex and star-convex regimes. It derives convergence guarantees under both Frobenius- and spectral-norm Lipschitz smoothness without requiring uniform smoothness and shows Muon can outperform gradient methods when Hessians are low-rank and approximately blockwise diagonal. Theoretical results are corroborated by experiments on neural-network-like settings and quadratic functions, illustrating conditions under which Muon has advantages. Overall, the paper highlights how exploiting matrix structure in optimization can yield improved convergence behavior and outlines directions for exploiting Hessian structure in future optimizer design.

Abstract

The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We further characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices -- phenomena widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.

Paper Structure

This paper contains 20 sections, 13 theorems, 93 equations, 4 figures, 2 tables, 3 algorithms.

Key Result

Proposition 3.6

If $\widehat{W}\in \mathbb{R}^{m\times n}$ is an $\epsilon$-nuclear norm stationary point of $f$, then it is also an $\epsilon$-Frobenius norm stationary point of $f$. If $\widehat{W}\in \mathbb{R}^{m\times n}$ is an $\epsilon$-Frobenius norm stationary point of $f$, then it is also an $\sqrt{r}\eps

Figures (4)

  • Figure 1: Comparison of (S)GD, Adam, Muon (using Newton–Schulz iterations with momentum), Muon_withoutM (using Newton–Schulz iterations without momentum), and Muon_SVD (\ref{['alg: muon']}). In the deterministic setting (a), loss is defined and trained over a fixed subset of CIFAR-10 krizhevsky2009learning. In the stochastic setting (b), training utilizes mini-batches randomly sampled from the complete CIFAR-10 training set. The loss is evaluated on the entire CIFAR-10 training set.
  • Figure 2: Comparison of $J_t$, $L_t$, $\|\nabla f(W_t)\|_*^2$ and $\|\nabla f(W_t)\|_{\rm F}^2$ over the training process of GD and Muon (\ref{['alg: muon_deterministic']}). $f$ is defined as the cross-entropy loss of a MLP with three matrix parameters $W^1\in\mathbb{R}^{128\times784}, W^2\in\mathbb{R}^{64\times128}, W^3\in\mathbb{R}^{10\times64}$ over a fixed subset of MNIST. We show the gradients and Hessians with respect to $W^2$ in this Figure. Detailed settings can be found in Appendix \ref{['app: exp']}.
  • Figure 3: Experiments on a quadratic function. Detailed settings can be found in \ref{['app: exp']}.
  • Figure 4: $\varphi^5(x)$ with $\varphi(x)=ax+bx^3+cx^5$ and $a=3.4445$, $b=-4.7750$, $c=2.0315$. Similar to the Figure 4 in jordan2024muon. Line 0.65 and 1.25 are just for illustrative purposes; one can actually choose tighter bounds.

Theorems & Definitions (24)

  • Definition 3.4
  • Definition 3.5
  • Proposition 3.6
  • Theorem 4.1
  • Corollary 4.2
  • Theorem 4.3
  • Corollary 4.4
  • Lemma 4.6
  • Theorem 4.8
  • Theorem 4.11
  • ...and 14 more