Table of Contents
Fetching ...

The Newton-Muon Optimizer

Zhehang Du, Weijie Su

Abstract

The Muon optimizer has received considerable attention for its strong performance in training large language models, yet the design principle behind its matrix-gradient orthogonalization remains largely elusive. In this paper, we introduce a surrogate model that not only sheds new light on the design of Muon, but more importantly leads to a new optimizer. In the same spirit as the derivation of Newton's method, the surrogate approximates the loss as a quadratic function of the perturbation to a weight matrix $W$ using only three matrices: the gradient $G$, an output-space curvature matrix $H$, and the data matrix $Z$ that stacks the layer inputs. By minimizing this surrogate in one step and adopting a certain isotropic assumption on the weights, we obtain the closed-form update rule (up to momentum and weight decay) $W \leftarrow W - η\cdot \mathrm{msgn}(G(ZZ^\top)^{-1})$, where $η$ is the learning rate and $\mathrm{msgn}(X)=UV^\top$ if $X=USV^\top$ is a compact singular value decomposition. This new optimization method, which we refer to as Newton-Muon, shows that standard Muon can be interpreted as an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, on a reproduction of the earliest publicly released Modded-NanoGPT speedrun configuration using Muon for GPT-2 pretraining, Newton-Muon reaches the target validation loss in 6\% fewer iteration steps and reduces wall-clock training time by about 4\%.

The Newton-Muon Optimizer

Abstract

The Muon optimizer has received considerable attention for its strong performance in training large language models, yet the design principle behind its matrix-gradient orthogonalization remains largely elusive. In this paper, we introduce a surrogate model that not only sheds new light on the design of Muon, but more importantly leads to a new optimizer. In the same spirit as the derivation of Newton's method, the surrogate approximates the loss as a quadratic function of the perturbation to a weight matrix using only three matrices: the gradient , an output-space curvature matrix , and the data matrix that stacks the layer inputs. By minimizing this surrogate in one step and adopting a certain isotropic assumption on the weights, we obtain the closed-form update rule (up to momentum and weight decay) , where is the learning rate and if is a compact singular value decomposition. This new optimization method, which we refer to as Newton-Muon, shows that standard Muon can be interpreted as an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, on a reproduction of the earliest publicly released Modded-NanoGPT speedrun configuration using Muon for GPT-2 pretraining, Newton-Muon reaches the target validation loss in 6\% fewer iteration steps and reduces wall-clock training time by about 4\%.

Paper Structure

This paper contains 67 sections, 7 theorems, 107 equations, 10 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

Denote by ${\Sigma}_{{W}} \coloneqq ({W}-W^\star)({W}-W^\star)^\top \in \mathbb{R}^{m\times m}$ the displacement second moment matrix and write ${\Sigma}_{{W}}^{1/2}$ for its unique positive semidefinite square root. Then under Assumption assump:exact-newton, the unique minimizer ${Q}^\star$ of eq:t $\blacktriangleleft$$\blacktriangleleft$

Figures (10)

  • Figure 1: Top: short track Record #4 validation loss comparison on the Modded-NanoGPT speedrun benchmark. Record #4 is the earliest publicly released configuration using Muon, and our reproduction on a single H100 GPU is denoted Muon. Newton--Muon adds the activation right-preconditioner before the Newton--Schulz iterations. Newton--Muon reaches the Muon baseline final validation loss in $6\%$ fewer steps; despite a $1.8\%$ higher per-step cost from right-preconditioning, it reduces wall-clock time to that loss by about $4\%$. Bottom: CIFAR-10 experiments (Appendix \ref{['app:cifar10-details']}) on a 32-layer residual MLP show that Newton--Muon outperforms both Muon and AdamW in both per-step efficiency and overall wall-clock time.
  • Figure 2: Left: standard Muon is orthogonally equivariant. Right: Newton--Muon takes the pair $(G,Z)$ as input, and the diagram commutes if the right rotation of $G$ is accompanied by the transformation $Z\mapsto O_n^\top Z$.
  • Figure 3: Numerical study with spiked activation second moment $\mathrm{diag}(\kappa,1,\ldots,1)$ and $\kappa=64$. Top: baseline case $(N,p)=(8192,0.3)$. Middle: more uniform curvature $(N,p)=(8192,2.4)$. Bottom: smaller sample size $(N,p)=(1024,0.3)$. The left column shows the spectrum of ${H}$, and the right column shows the corresponding mean absolute scores $s({Q})$.
  • Figure 4: Refresh ablation for Newton--Muon on short track Record #4. Grouped bar plot over refresh interval $k$, with one bar per EWMA coefficient $\beta$ (ridge scaling fixed at $\gamma=0.2$ and learning rate fixed at $0.0040$).
  • Figure 5: Left: Ridge-scaling ablation for Newton--Muon on short track Record #4. Bar plot over ridge scaling $\gamma$ with $k=32$, $\beta=0.95$, and learning rate $0.0040$ fixed. Right: Learning-rate ablation for Newton--Muon on short track Record #4. Bar plot over learning rate with $k=32$, $\beta=0.95$, and $\gamma=0.2$ fixed.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Proposition 1
  • proof : Proof of Proposition \ref{['prop:exact-newton-polar']}
  • Theorem 1
  • Proposition 2
  • proof : Proof of Proposition \ref{['prop:Newton--Muon-descent']}
  • Corollary 1: Isotropic activations recover Muon
  • Lemma 1: Dynamics under the single spike model
  • proof : Proof of Lemma \ref{['lem:ss-mode-wise']}
  • Corollary 2: Convergence rates
  • proof : Proof of Corollary \ref{['cor:ss-rates']}
  • ...and 2 more