Muon+: Towards Better Muon via One Additional Normalization Step

Ruijie Zhang; Yequan Zhao; Ziyue Liu; Zhengyang Wang; Zheng Zhang

Muon+: Towards Better Muon via One Additional Normalization Step

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang

TL;DR

This work proposes a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization and extends the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$.

Abstract

The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.

Muon+: Towards Better Muon via One Additional Normalization Step

TL;DR

Abstract

. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.

Paper Structure (26 sections, 5 equations, 4 figures, 15 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 4 figures, 15 tables, 1 algorithm.

Introduction
Background
Muon Optimizer
Remark on the role of normalization.
The Muon+ Method
Experiments
Pre-training GPT
Pre-training LLaMA
Overtraining GPT and LLaMA
Overtraining GPT
Overtraining LLaMA
Ablation Study
Performance under Different Learning Rates
Impact of Different Normalization Directions
Ablation for Polar Methods $\mathrm{Ortho}(\cdot)$
...and 11 more sections

Figures (4)

Figure 1: Pre-training GPT and LLaMA models at scales ranging from 130M to 1B parameters under compute-optimal settings. Quantitative results are provided in Section \ref{['sec:experiments']}. Muon+ consistently outperforms Muon across all runs. We also conduct overtraining experiments for both GPT and LLaMA; the results are presented in Section \ref{['sec:overtrain']}.
Figure 2: Training loss curves under overtraining for GPT-Base and LLaMA-350M.
Figure 3: Validation perplexity sweep for LLaMA models under different settings. Here "none (baseline)" is the standard Muon optimizer; "row", "col", "row_col" and "col_row" indicate different normalization directions in Muon+.
Figure 4: Validation perplexity sweep for GPT models under different settings.

Muon+: Towards Better Muon via One Additional Normalization Step

TL;DR

Abstract

Muon+: Towards Better Muon via One Additional Normalization Step

Authors

TL;DR

Abstract

Table of Contents

Figures (4)