Table of Contents
Fetching ...

Effective Quantization of Muon Optimizer States

Aman Gupta, Rafael Celente, Abhishek Shivanna, D. T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, S. Sathiya Keerthi

TL;DR

The paper tackles the memory burden imposed by FP32 optimizer states in large-scale LLM training and proposes an 8-bit Muon optimizer using blockwise quantization. It demonstrates that 8-bit Muon, with either linear or dynamic quantization, can match the performance of full-precision Muon on GPT-style pretraining up to 2.7B parameters while delivering substantial memory savings (up to 62%, and up to 75% in extended setups). The authors provide theoretical analyses showing Muon’s robustness to quantization noise, explain AdamW’s instability under linear quantization, and show SGD with momentum can remain stable under linear quantization. The work offers a practical, scalable default for memory-efficient training and suggests future enhancements by combining with additional quantization strategies and low-rank momentum techniques.

Abstract

The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large-scale deployment. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization. In extensive Chinchilla-optimal experiments on pre-training models of up to 2.7B in size and fine-tuning them for instruction following, we demonstrate that 8-bit Muon achieves parity with Muon in terms of validation loss and downstream benchmarks, while achieving up to a 62\% reduction in optimizer state footprint. Crucially, we show that Muon's update mechanism is uniquely compatible with a simple linear quantization scheme, bypassing the complex dynamic scaling required for quantized AdamW. We supplement our empirical findings with a theoretical analysis of Muon's robustness to quantization noise.

Effective Quantization of Muon Optimizer States

TL;DR

The paper tackles the memory burden imposed by FP32 optimizer states in large-scale LLM training and proposes an 8-bit Muon optimizer using blockwise quantization. It demonstrates that 8-bit Muon, with either linear or dynamic quantization, can match the performance of full-precision Muon on GPT-style pretraining up to 2.7B parameters while delivering substantial memory savings (up to 62%, and up to 75% in extended setups). The authors provide theoretical analyses showing Muon’s robustness to quantization noise, explain AdamW’s instability under linear quantization, and show SGD with momentum can remain stable under linear quantization. The work offers a practical, scalable default for memory-efficient training and suggests future enhancements by combining with additional quantization strategies and low-rank momentum techniques.

Abstract

The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large-scale deployment. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization. In extensive Chinchilla-optimal experiments on pre-training models of up to 2.7B in size and fine-tuning them for instruction following, we demonstrate that 8-bit Muon achieves parity with Muon in terms of validation loss and downstream benchmarks, while achieving up to a 62\% reduction in optimizer state footprint. Crucially, we show that Muon's update mechanism is uniquely compatible with a simple linear quantization scheme, bypassing the complex dynamic scaling required for quantized AdamW. We supplement our empirical findings with a theoretical analysis of Muon's robustness to quantization noise.

Paper Structure

This paper contains 36 sections, 3 theorems, 25 equations, 5 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathbf{w}^{(1)}$ denote the parameters after one step of Adam as given in Algorithm alg:adam, and let $\tilde{\mathbf{w}}^{(1)}$ denote the parameters after one step of the same algorithm with 8-bit linear quantization applied to the moment estimates (Definition def:8bit-linear), i.e.: Suppose that each entry of gradient $\mathbf{g}^{(1)} \in \mathbb{R}^d$ satisfies $\mathbb{P}\!\left(\tfra

Figures (5)

  • Figure 1: Validation loss for GPT-2 Medium across 7 variants.
  • Figure 2: Validation top-1 accuracy on ImageNet for ResNet-50. Quantized SGD overlaps with the FP32 baseline. AdamW with FP32 states underperforms slightly, while the quantized AdamW variant diverged at the first step and is not shown.
  • Figure 3: Pretraining training-loss curves for GPT-2 Small, Medium, Large, XL, and XXL with three optimizers: Muon, Muon-8L, and Muon-8L/AdamW-8D. Curves show the mean over 5 random seeds; error bars (seed-to-seed variation) are present but visually negligible.
  • Figure 4: Pretraining validation-loss curves for GPT-2 Small, Medium, Large, XL, and XXL with three optimizers: Muon, Muon-8L, and Muon-8L/AdamW-8D. Curves show the mean over 5 random seeds; error bars (seed-to-seed variation) are present but visually negligible.
  • Figure 5: Relative pairwise validation-loss differences relative to the Muon baseline during pretraining for GPT-2 Small, Medium, Large, XL, and XXL.

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Definition 1