Table of Contents
Fetching ...

A Note on the Convergence of Muon

Jiaxiang Li, Mingyi Hong

TL;DR

The note analyzes convergence properties of the Muon optimizer, a singular-value decomposition–based momentum-like stochastic optimizer for matrix-valued objectives. It develops a Frobenius-norm descent lemma and explicit convergence rates under standard smoothness and noise assumptions, detailing how step-size, momentum, and batch size shape performance. It further examines a related heavy-ball/minibatch scheme with a spectral-norm objective, providing a bound that depends on the initial optimal gap and noise variance, and discusses fundamental limitations on batch-free convergence. Together, the results illuminate parameter tuning trade-offs for Muon during large-model pretraining and contribute to the theoretical understanding of SVD-based optimization updates in stochastic settings.

Abstract

In this note, we inspect the convergence of a new optimizer for pretraining LLMs, namely the Muon optimizer. Such an optimizer is closely related to a specialized steepest descent method where the update direction is the minimizer of the quadratic approximation of the objective function under spectral norm. We provide the convergence analysis on both versions of the optimizer and discuss its implications.

A Note on the Convergence of Muon

TL;DR

The note analyzes convergence properties of the Muon optimizer, a singular-value decomposition–based momentum-like stochastic optimizer for matrix-valued objectives. It develops a Frobenius-norm descent lemma and explicit convergence rates under standard smoothness and noise assumptions, detailing how step-size, momentum, and batch size shape performance. It further examines a related heavy-ball/minibatch scheme with a spectral-norm objective, providing a bound that depends on the initial optimal gap and noise variance, and discusses fundamental limitations on batch-free convergence. Together, the results illuminate parameter tuning trade-offs for Muon during large-model pretraining and contribute to the theoretical understanding of SVD-based optimization updates in stochastic settings.

Abstract

In this note, we inspect the convergence of a new optimizer for pretraining LLMs, namely the Muon optimizer. Such an optimizer is closely related to a specialized steepest descent method where the update direction is the minimizer of the quadratic approximation of the objective function under spectral norm. We provide the convergence analysis on both versions of the optimizer and discuss its implications.

Paper Structure

This paper contains 3 sections, 3 theorems, 51 equations.

Key Result

Lemma 2.1

For update eq:muon2, we have that If we take $\eta_t=\eta$ a constant, we have

Theorems & Definitions (5)

  • Lemma 2.1
  • Theorem 2.1
  • Theorem 3.1
  • Remark 3.1
  • Remark 3.2