Table of Contents
Fetching ...

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang

TL;DR

It is argued that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions and HTMuon is proposed, which preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra.

Abstract

Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

TL;DR

It is argued that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions and HTMuon is proposed, which preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra.

Abstract

Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten- norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.
Paper Structure (53 sections, 9 theorems, 43 equations, 12 figures, 22 tables, 2 algorithms)

This paper contains 53 sections, 9 theorems, 43 equations, 12 figures, 22 tables, 2 algorithms.

Key Result

Lemma 4.1

Suppose the singular values of matrix ${\bm{W}} \in \mathbb{R}^{n\times m}$ follow $s_k = s_1k^{-s}, 1\leq k\leq m$, we have PL exponent $\alpha$ of ${\bm{W}}$ satisfies $\alpha=1+\frac{1}{2s}.$

Figures (12)

  • Figure 1: Muon_NS vs. Muon_SVD on C4 dataset. (a) Validation perplexity for LLaMA-60M/135M trained: Muon_NS consistently achieves lower perplexity than Muon_SVD. Both Learning rates for 60M is 0.03 and for 135M is 0.02. (b)(c) Spectra of update matrices at steps $1/9000/19000$ shown in different colors: Muon_SVD enforces an exactly "all-ones" spectrum in the update matrices; Muon_NS stays close to one but retains noticeable deviations, implicitly down-weighting noise-dominated singular-vector directions and correlating with improved performance.
  • Figure 2: (a)Average PL $\bar{\alpha}$ of weight ESDs for LLaMA-60M and LLaMA-135M trained on C4 with Muon and COSMOS. Muon yields a higher mean $\bar{\alpha}$, indicating less heavy-tailed spectra than COSMOS; (b) COSMOS outperforms Muon for LLaMA-60M and LLaMA-135M models on C4 datatset.
  • Figure 3: Comparison with Muon variant optimizers on LLaMA-60M and 135M on C4 dataset. All optimizers are carefully tuned via grid search; detailed results and hyperparameter settings are provided in Table \ref{['table: muon_variants']} and \ref{['table:hyperparam_C4']} in Appendix \ref{['app:more_results']} and \ref{['app:hyperparameter']}.
  • Figure 4: Comparison with state-of-the-art pretraining optimizers on LLaMA-60M and 135M on C4 dataset. All optimizers are carefully tuned via grid search; detailed results and hyperparameter settings are provided in Table \ref{['table: muon_variants']} and \ref{['table:hyperparam_C4']} in Appendix \ref{['app:more_results']} and \ref{['app:hyperparameter']}.
  • Figure 5: we evaluate applying HTMuon and HTMuon_NS on LLaMA-60M and LLaMA-135M every 1, 5, 10, and 25 steps. We report the average per-step runtime overhead for all methods. Detailed results and hyperparameter settings are provided in Table \ref{['table:ppl_intervals']} and \ref{['table:time_intervals']} in Appendix \ref{['app:more_results']}.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Lemma 4.1: Proof in Appendix \ref{['app:proof_PL_alpha']}
  • Theorem 6.1
  • Definition 6.5
  • Theorem 6.6: HTMuon
  • Lemma A.1: Von Neumann trace inequality
  • Theorem A.2
  • Lemma A.3
  • Theorem A.4: Muon shen2025convergence
  • Theorem A.5: HTMuon: Theorem \ref{['thm: htmuon_nonconvex']}
  • Lemma A.6