Table of Contents
Fetching ...

Convergence Bound and Critical Batch Size of Muon Optimizer

Naoki Sato, Hiroki Naganuma, Hideaki Iiduka

TL;DR

This work addresses the theoretical understanding of Muon, a matrix-structured optimizer that orthogonalizes momentum to exploit parameter geometry. It provides convergence guarantees for four practical configurations (with/without Nesterov momentum and with/without weight decay) and derives the critical batch size to minimize training cost, supported by experiments on vision and language modeling tasks. Key findings show that weight decay tightens bounds and that a learning-rate condition $\eta \le 1/\lambda$ promotes stability, with the critical batch size depending on momentum and decay parameters. Collectively, the results offer both theoretical insight and practical guidance for deploying Muon in large-scale settings, highlighting its potential as a competitive alternative to AdamW and other baselines.

Abstract

Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings across workloads including image classification and language modeling task.

Convergence Bound and Critical Batch Size of Muon Optimizer

TL;DR

This work addresses the theoretical understanding of Muon, a matrix-structured optimizer that orthogonalizes momentum to exploit parameter geometry. It provides convergence guarantees for four practical configurations (with/without Nesterov momentum and with/without weight decay) and derives the critical batch size to minimize training cost, supported by experiments on vision and language modeling tasks. Key findings show that weight decay tightens bounds and that a learning-rate condition promotes stability, with the critical batch size depending on momentum and decay parameters. Collectively, the results offer both theoretical insight and practical guidance for deploying Muon in large-scale settings, highlighting its potential as a competitive alternative to AdamW and other baselines.

Abstract

Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings across workloads including image classification and language modeling task.

Paper Structure

This paper contains 43 sections, 14 theorems, 72 equations, 16 figures, 7 tables, 1 algorithm.

Key Result

Theorem 3.1

Suppose Assumptions assum:01 and assum:02 hold. Then, for all $t \in \mathbb{N}$, (i) for Muon w/o Nesterov and w/o Weight Decay, (ii) for Muon w/ Nesterov and w/o Weight Decay, where $\Delta := \| M_0 - \nabla f(W_0) \|_{\rm{F}}^2$, $\bar{\beta} := \frac{(2\beta+1)(1-\beta)}{2}$, $\nu := \sqrt{2(1-\beta)n}$, and $\gamma:=\frac{L\eta}{1-\beta}.$

Figures (16)

  • Figure 1: Empirical validation of the stability condition in Proposition \ref{['prop:3_2']}. Final gradient norm (left) and training loss (right) for ResNet-18 on CIFAR-10 with Muon at $\lambda{=}0.0625$. The dashed line shows $\eta{=}1/\lambda$. Training is most stable near this value.
  • Figure 2: Convergence rate comparison for ResNet-18 on CIFAR-10 with batch size 2048. Training loss (left) and smoothed gradient norm (right) over steps. Muon with Nesterov and weight decay converges fastest, consistent with the bounds in Table \ref{['tab:rate']}.
  • Figure 3: Batch-size scaling and SFO on ResNet-18/CIFAR-10. (Left) Steps to reach 90% test accuracy. (Right) SFO to reach 95% training accuracy. Muon achieves the best efficiency across batch sizes; Nesterov shifts the critical batch size to the right.
  • Figure 4: Dependence of SFO and critical batch size on $\beta$ for ResNet-18/CIFAR-10. The critical batch size consistently decreases as $\beta$ increases, in line with Section \ref{['sec:cri']}.
  • Figure 5: Batch-size scaling on C4 with Llama3.1 (160M). Steps to reach the target training loss (left) and SFO complexity (right) versus batch size. Muon outperforms AdamW in terms of both the number of steps required to reach the target loss and the SFO complexity in almost all cases. Nesterov momentum and weight decay provide little additional benefit for this workload.
  • ...and 11 more figures

Theorems & Definitions (14)

  • Theorem 3.1
  • Proposition 3.1
  • Proposition 3.2
  • Corollary 3.1
  • Theorem 3.2
  • Proposition 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Lemma 1.1
  • Lemma 1.2
  • ...and 4 more