Table of Contents
Fetching ...

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang

TL;DR

The paper tackles why sign-based optimizers like SignSGD, Lion, Muon, and Muonlight outperform variance-adaptive methods such as AdamW in training large language models under heavy-tailed gradient noise. It introduces a generalized heavy-tailed noise model with tail index $p\in(1,2]$ and gradient-dependent variance, and proves sharp $O$-convergence rates for both vector and matrix sign-descent methods under a generalized smoothness framework. The authors develop novel non-Euclidean martingale concentration inequalities (vector in $\ell_1$ and matrix in nuclear norm) to handle the cumulative stochasticity, providing a theoretical justification for the robustness and efficiency of sign-based updates in noisy, high-dimensional settings. Empirically, they validate the noise model on LLM pretraining (GPT-2 scale) and show that sign-based optimizers achieve superior training efficiency and stability compared to NSGD and AdamW, aligning theory with practice and guiding optimization choices for heavy-tailed regimes.

Abstract

While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

TL;DR

The paper tackles why sign-based optimizers like SignSGD, Lion, Muon, and Muonlight outperform variance-adaptive methods such as AdamW in training large language models under heavy-tailed gradient noise. It introduces a generalized heavy-tailed noise model with tail index and gradient-dependent variance, and proves sharp -convergence rates for both vector and matrix sign-descent methods under a generalized smoothness framework. The authors develop novel non-Euclidean martingale concentration inequalities (vector in and matrix in nuclear norm) to handle the cumulative stochasticity, providing a theoretical justification for the robustness and efficiency of sign-based updates in noisy, high-dimensional settings. Empirically, they validate the noise model on LLM pretraining (GPT-2 scale) and show that sign-based optimizers achieve superior training efficiency and stability compared to NSGD and AdamW, aligning theory with practice and guiding optimization choices for heavy-tailed regimes.

Abstract

While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.
Paper Structure (52 sections, 26 theorems, 147 equations, 9 figures, 1 table, 6 algorithms)

This paper contains 52 sections, 26 theorems, 147 equations, 9 figures, 1 table, 6 algorithms.

Key Result

Theorem 1

Under ass:non-convexityass:generalized-smoothass:unbiasedass:heavy-tailed-noise, define $\Delta_f:=f(\mathbf{x}_1)-f^*$. By setting alg:signsgd ensures

Figures (9)

  • Figure 1: Verification of \ref{['ass:heavy-tailed-noise']}. x-axis: $|\nabla_j f|^p$, y-axis: $\mathbb{E} \left[\left|\mathbf{g}_j - \nabla_j f\right|^p\right]$.
  • Figure 2: Verification of \ref{['ass:heavy-tailed-noise-matrix']}. x-axis: $\left\Vert\nabla f\right\Vert_*^p$, y-axis: $\mathbb{E} \left[\left\Vert\mathbf{V}_0\right\Vert_*^{p/2} \left\Vert\mathbf{G} - \nabla f\right\Vert^p_{\left|\mathbf{V}_0\right|_{\textnormal{m}}^{-1}}\right]$.
  • Figure 3: The training loss, validation loss and accuracy for nanoGPT trained on C4.
  • Figure 4: Noise histograms of nanoGPT on C4 at initialization sampled from different coordinates. Q-Q plots are shown at the top-right of each histogram, visualizing the distribution compared with a Gaussian(the red diagonal reference line).
  • Figure 5: Verification of \ref{['ass:heavy-tailed-noise']}. x-axis: $|\nabla_j f|^p$, y-axis: $\mathbb{E} \left[\left|\mathbf{g}_j - \nabla_j f\right|^p\right]$.
  • ...and 4 more figures

Theorems & Definitions (44)

  • Theorem 1
  • remark 1
  • Theorem 2
  • remark 2
  • Theorem 3
  • remark 3
  • Theorem 4
  • Lemma 1
  • Lemma 2
  • Lemma 3: Concentration in $\ell_1$-norm
  • ...and 34 more