Table of Contents
Fetching ...

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, Ganzhao Yuan

Abstract

Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton--Schulz using row/column squared-norm statistics and only $\mathcal{O}(m+n)$ auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the $\widetilde{\mathcal{O}}(T^{-1/4})$ stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

Abstract

Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton--Schulz using row/column squared-norm statistics and only auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.

Paper Structure

This paper contains 32 sections, 15 theorems, 174 equations, 7 figures, 7 tables.

Key Result

Theorem 3.1

Let $\mathbf G\in\mathbb R^{p\times q}$ have rank $r\ge 1$, with compact SVD $\mathbf G=\mathbf U\Sigma\mathbf V^\top$, and suppose $p\le q$. Fix $\alpha\ge \|\mathbf G\|_2$, set $\mathbf X_0=\alpha^{-1}\mathbf G$, and define $\mathbf X_{k+1}=\left(a\mathbf I_p+b\mathbf X_k\mathbf X_k^\top+c(\mathbf Here $(x)_+:=\max\{x,0\}$. See Appendix proof:th_ns for details.

Figures (7)

  • Figure 1: Random Gaussian matrices with controlled shapes and spectral spreads. Top: finite-step relative Frobenius error to the exact polar factor across Newton--Schulz steps. Bottom: raw and post-normalization condition numbers. Two-sided row/column normalization yields the smallest error and the most consistent spectral compression.
  • Figure 2: Finite-step orthogonalization error across Newton--Schulz steps at 1%, 10%, 50%, and 100% of training. Top: module-wise median; bottom: module-wise mean; shaded bands denote the 25%--75% range. Two-sided row/column normalization decays fastest.
  • Figure 3: The training and validation loss curves, plotted against both training tokens and wall-clock time on LLaMA-2 130M.
  • Figure 4: The training and validation loss curves, plotted against both training tokens and wall-clock time on LLaMA-2 350M.
  • Figure 5: Learning-rate sweeps for LLaMA2-130M and LLaMA2-350M trained on C4 for 2.6B and 7.5B tokens, respectively.
  • ...and 2 more figures

Theorems & Definitions (39)

  • Theorem 3.1
  • Remark 3.1
  • Proposition 3.2
  • Corollary 3.3
  • Remark 3.2
  • Proposition 3.4
  • Theorem 3.5
  • Remark 3.3
  • Remark 3.4
  • Remark 3.5
  • ...and 29 more