MuonAll: Muon Variant for Efficient Finetuning of Large Language Models
Saurabh Page, Advait Joshi, S. S. Sonawane
TL;DR
MuonAll extends the Muon optimizer to include all parameter types by transforming 1D parameters into diagonal representations and applying Newton-Schulz-based whitening, enabling a single optimizer for all parameters in fine-tuning of LLMs. The paper evaluates MuonAll and Muon against AdamW on three public-pretrained bases (Qwen2-0.5B, SmolLM2-360M, GPT2-medium), showing parity with AdamW and sometimes superior performance for Muon on tasks like MMLU and GSM8K. It provides open-source distributed implementations and demonstrates that MuonAll is a viable alternative for efficient, stable SFT on small-to-mid sized models. The work highlights potential future directions in extending spectral-norm based optimizers with broader norm choices and parameter-wide integration.
Abstract
Muon optimizer has demonstrated robust results in pretraining of language models but its performance in finetuning of existing public pretrained models is not yet explored. Currently, Muon is used along with AdamW introducing a scope of improvement for adopting all parameters inside Muon. We introduce MuonAll, which incorporates all the parameters inside Muon by transforming into 2D matrices. We conduct extensive finetuning experiments across publicly available language models with model sizes upto half billion parameters. Muon and MuonAll perform at par with AdamW across major benchmarks, highlighting their effectiveness as alternative optimizers. We open-source the distributed implementations of Muon and MuonAll, available at https://github.com/Saurabh750/optimizer
