Table of Contents
Fetching ...

On the Convergence of Muon and Beyond

Da Chang, Yongxiang Liu, Ganzhao Yuan

TL;DR

This work addresses the theoretical gap in Muon’s convergence for stochastic non-convex optimization by introducing two variance-reduced variants, Muon-MVR1 and Muon-MVR2. It proves that Muon-MVR2 achieves the optimal iteration complexity $\tilde{\mathcal{O}}(T^{-1/3})$ in general non-convex settings, and shows last-iterate convergence under the PL condition with rates $\tilde{\mathcal{O}}(T^{-2/3})$ for Muon-MVR2 and $\tilde{\mathcal{O}}(T^{-1/2})$ for Muon-MVR1. Under the PL framework, the results provide concrete nonergodic guarantees, complementing ergodic analyses. Experiments on CIFAR-10 and C4 corroborate the theoretical findings, demonstrating accelerated per-iteration convergence and validating Muon-MVR2 as a practically effective, theoretically optimal variant for large-scale training.

Abstract

The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal iteration complexity of $\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To explore the theoretical limits of the Muon framework, we analyze two Momentum-based Variance-Reduced variants: a one-batch version (Muon-MVR1) and a two-batch version (Muon-MVR2). We provide the first rigorous proof that incorporating variance reduction enables Muon-MVR2 to attain the optimal iteration complexity of $\tilde{\mathcal{O}}(T^{-1/3})$, thereby matching the theoretical lower bound for this class of problems. Furthermore, our analysis establishes last-iterate convergence guarantees for Muon variants under the Polyak-Łojasiewicz (PŁ) condition. Extensive experiments on vision (CIFAR-10) and language (C4) benchmarks corroborate our theoretical findings on per-iteration convergence. Overall, this work offers the first proof of optimality for a Muon-style optimizer and clarifies the path toward developing more practically efficient, accelerated variants.

On the Convergence of Muon and Beyond

TL;DR

This work addresses the theoretical gap in Muon’s convergence for stochastic non-convex optimization by introducing two variance-reduced variants, Muon-MVR1 and Muon-MVR2. It proves that Muon-MVR2 achieves the optimal iteration complexity in general non-convex settings, and shows last-iterate convergence under the PL condition with rates for Muon-MVR2 and for Muon-MVR1. Under the PL framework, the results provide concrete nonergodic guarantees, complementing ergodic analyses. Experiments on CIFAR-10 and C4 corroborate the theoretical findings, demonstrating accelerated per-iteration convergence and validating Muon-MVR2 as a practically effective, theoretically optimal variant for large-scale training.

Abstract

The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal iteration complexity of in stochastic non-convex settings, where denotes the number of iterations. To explore the theoretical limits of the Muon framework, we analyze two Momentum-based Variance-Reduced variants: a one-batch version (Muon-MVR1) and a two-batch version (Muon-MVR2). We provide the first rigorous proof that incorporating variance reduction enables Muon-MVR2 to attain the optimal iteration complexity of , thereby matching the theoretical lower bound for this class of problems. Furthermore, our analysis establishes last-iterate convergence guarantees for Muon variants under the Polyak-Łojasiewicz (PŁ) condition. Extensive experiments on vision (CIFAR-10) and language (C4) benchmarks corroborate our theoretical findings on per-iteration convergence. Overall, this work offers the first proof of optimality for a Muon-style optimizer and clarifies the path toward developing more practically efficient, accelerated variants.

Paper Structure

This paper contains 30 sections, 12 theorems, 119 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Suppose Assumptions ass:1, ass:2, and ass:3 hold. Consider Algorithm alg:muon with a learning rate of $\eta_t = t^{-3/4}$. The expected average squared norm of the gradient is bounded for the following options: See Appendix proof:th1_vr1 for details.

Figures (2)

  • Figure 1: Training dynamics of Muon-MVR2, Muon-MVR1, Muon-MVR1 ($\gamma=0$), and a baseline on CIFAR-10 with ResNet-18. The plots show (a) accuracy and (b) loss versus epochs for both training and testing, along with (c) test accuracy versus wall-clock time.
  • Figure 2: LLaMA2-130M train and validation curves on C4 Dataset

Theorems & Definitions (30)

  • Theorem 3.1
  • Remark 3.1
  • Theorem 3.2
  • Remark 3.2
  • Remark 3.3
  • Remark 3.4
  • Theorem 3.3
  • Theorem 3.4
  • Remark 3.5
  • Remark 4.1
  • ...and 20 more