Effective Quantization of Muon Optimizer States
Aman Gupta, Rafael Celente, Abhishek Shivanna, D. T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, S. Sathiya Keerthi
TL;DR
The paper tackles the memory burden imposed by FP32 optimizer states in large-scale LLM training and proposes an 8-bit Muon optimizer using blockwise quantization. It demonstrates that 8-bit Muon, with either linear or dynamic quantization, can match the performance of full-precision Muon on GPT-style pretraining up to 2.7B parameters while delivering substantial memory savings (up to 62%, and up to 75% in extended setups). The authors provide theoretical analyses showing Muon’s robustness to quantization noise, explain AdamW’s instability under linear quantization, and show SGD with momentum can remain stable under linear quantization. The work offers a practical, scalable default for memory-efficient training and suggests future enhancements by combining with additional quantization strategies and low-rank momentum techniques.
Abstract
The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large-scale deployment. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization. In extensive Chinchilla-optimal experiments on pre-training models of up to 2.7B in size and fine-tuning them for instruction following, we demonstrate that 8-bit Muon achieves parity with Muon in terms of validation loss and downstream benchmarks, while achieving up to a 62\% reduction in optimizer state footprint. Crucially, we show that Muon's update mechanism is uniquely compatible with a simple linear quantization scheme, bypassing the complex dynamic scaling required for quantized AdamW. We supplement our empirical findings with a theoretical analysis of Muon's robustness to quantization noise.
