Table of Contents
Fetching ...

Muon Outperforms Adam in Tail-End Associative Memory Learning

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, Vincent Y. F. Tan

TL;DR

The paper analyzes why Muon outperforms Adam in training transformers by focusing on associative-memory components in VO and FFN. Through empirical ablations, spectral analyses, and a one-layer theoretical model, it shows Muon yields more isotropic weight spectra and balanced learning across head and tail classes, particularly on heavy-tailed data. The findings demonstrate that Muon’s matrix-norm optimization aligns with outer-product memory structures, improving tail knowledge and overall stability, and that VO+FFN is the key driver of gains. The results scale to larger models and offer a principled explanation of Muon’s advantages with practical implications for memory-augmented transformer optimization.

Abstract

The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon's superiority. Motivated by this associative memory view, we then explain Muon's superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.

Muon Outperforms Adam in Tail-End Associative Memory Learning

TL;DR

The paper analyzes why Muon outperforms Adam in training transformers by focusing on associative-memory components in VO and FFN. Through empirical ablations, spectral analyses, and a one-layer theoretical model, it shows Muon yields more isotropic weight spectra and balanced learning across head and tail classes, particularly on heavy-tailed data. The findings demonstrate that Muon’s matrix-norm optimization aligns with outer-product memory structures, improving tail knowledge and overall stability, and that VO+FFN is the key driver of gains. The results scale to larger models and offer a principled explanation of Muon’s advantages with practical implications for memory-augmented transformer optimization.

Abstract

The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon's superiority. Motivated by this associative memory view, we then explain Muon's superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.

Paper Structure

This paper contains 27 sections, 5 theorems, 81 equations, 10 figures, 8 tables.

Key Result

Theorem 5.3

If Assumptions assump:ortho and assump:two_class hold, with fixed $\alpha,\beta$ such that $\alpha\neq\beta$, and $K$ goes to infinity, we obtain the following results for one-step gd, Muon, and Adam.

Figures (10)

  • Figure 1: Validation loss comparison on the 160M NanoGPT model with ungated and gated ffn. Panels (a) and (b) show the "Independent Blocks" results, where individual components are optimized separately, for models with ungated and gated ffn, respectively. Panels (c) and (d) show the "Combined Configurations" results, where multiple components are optimized jointly, again for ungated and gated ffn models.
  • Figure 2: Spectral Dynamics of Transformer Weight Matrices During Training. Each panel reports four metrics characterizing singular value distributions: SVD entropy, Top10E, eRank, and Q75/Q25 ratio. The four subplots correspond to different weight matrix groups: (a) VO, (b) VO (Gated ffn), (c) $W_{\text{out}}$, and (d) $W_{\text{out}}$ (Gated ffn).
  • Figure 3: Performance comparison of different optimizers for transformers with non-gated ffn on a heavy-tailed knowledge task. (a) Sample distribution per class, following a power law. (b–d) Performance of Muon, Adam, and SGD+Momentum. (e) Muon applied to VO and ffn, with Adam on QK. (f) Muon applied to QK, with Adam on VO and ffn.
  • Figure 4: (a) Average angles between $E_{i}$ or $\widetilde{E}_{i}$ in ffn at layers $5$, $10$, $15$, $20$, $25$ of Llama3-8b-instruct. (b) Results of one-step gd, Signgd, and Muon with both coupled and decoupled embeddings. For gd, the outcomes under the two embedding types coincide. (c) Results of multi-step gd, Signgd, and Muon with both coupled and decoupled embeddings.
  • Figure 5: Validation loss comparison on the 0.7B NanoGPT model. (a) Combined configuration with non-gated feed-forward networks.(b) Combined configuration with gated feed-forward networks.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Theorem 5.3
  • Theorem 5.4
  • Proposition F.1
  • Proposition F.2
  • Proposition F.3