NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

Hadi Mohaghegh Dolatabadi; Thalaiyasingam Ajanthan; Sameera Ramasinghe; Chamin P Hewa Koneputugodage; Shamane Siriwardhana; Violetta Shevchenko; Karol Pajak; James Snewin; Gil Avraham; Alexander Long

NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avraham, Alexander Long

TL;DR

This work proposes NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure, and shows that NuMuon increases weight compressibility and improves post-compression model quality under state-of-the-art LLM compression pipelines while retaining Muon's favorable convergence behavior.

Abstract

The rapid progress of large language models (LLMs) is increasingly constrained by memory and deployment costs, motivating compression methods for practical deployment. Many state-of-the-art compression pipelines leverage the low-rank structure of trained weight matrices, a phenomenon often associated with the properties of popular optimizers such as Adam. In this context, Muon is a recently proposed optimizer that improves LLM pretraining via full-rank update steps, but its induced weight-space structure has not been characterized yet. In this work, we report a surprising empirical finding: despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines. Motivated by this insight, we propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure. Across billion-parameter-scale models, we show that NuMuon increases weight compressibility and improves post-compression model quality under state-of-the-art LLM compression pipelines while retaining Muon's favorable convergence behavior.

NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

TL;DR

Abstract

Paper Structure (47 sections, 11 theorems, 66 equations, 22 figures, 18 tables, 3 algorithms)

This paper contains 47 sections, 11 theorems, 66 equations, 22 figures, 18 tables, 3 algorithms.

Introduction
Background
Muon Optimizer.
Linear Minimization Oracles.
LLM Compression via Low-Rank Structure.
Our Method
Motivation
NuMuon: Nuclear-norm-constrained Muon
Convergence Analysis
Practical Considerations
Top-$k$ SVD via Randomized Block Krylov Method.
Rank Scheduler.
Related Work
Low-Rank Structure in Neural Networks.
LLM Compression via Low-Rank Factorization.
...and 32 more sections

Key Result

Proposition 3.0

Let $\boldsymbol{M}\in\mathbb{R}^{d_{\rm out}\times d_{\rm in}}$ with thin SVD ${\boldsymbol{M}=\boldsymbol{U}\,\mathrm{diag}(\boldsymbol{\sigma})\,\boldsymbol{V}^\top}$, where $\sigma_1\ge \sigma_2\ge \cdots \ge 0$ and $q=\min(d_{\rm out},d_{\rm in})$. Consider the LMO There exists an optimal solution of the form ${\boldsymbol{\Delta W}^\star=-\boldsymbol{U}\,\mathrm{diag}(\boldsymbol{s}^\star)\

Figures (22)

Figure 1: Normalized stable rank evolution for Qwen3-0.6B across training steps for feedforward projection matrices. Each subplot shows the mean stable rank (normalized by the maximum rank), with shaded regions indicating standard deviation across all layers. All other weight matrices exhibit a similar low-rank behavior throughout training (see \ref{['fig:stable_rank_comparison_qwen3_all']} in the Appendix).
Figure 2: Validation perplexity on WikiText2 against generation inference throughput for Llama3-1.8B models compressed via SVD-LLM wang2025svdllm. As seen, for a given perplexity, NuMuon-trained models provide the fastest inference for moderate to extreme compression rates (40-80%). Our results for other models can be found in \ref{['fig:svdllm_efficiency']}.
Figure 3: $\delta_1^{(\mathrm{F})}$ as a proxy for the tail bound in \ref{['ass:tail_control']} for the feedforward projection matrices. As we see, this quantity is bounded and close to zero for NuMuon, supporting this assumption. For other parameters, please see \ref{['fig:delta_1']}.
Figure 4: Training loss convergence for language models of size 0.6B-1.8B parameter count. For each model family, we use AdamW, Muon, and NuMuon to train the model. For more details, please see \ref{['app:extended_results:details']}.
Figure 5: NuMuon's relative performance against Muon under SoTA LLM compression methods. The scatter plot displays the performance improvement on downstream tasks as well as validation perplexity improvements for all compression methods at all rates. The positive quadrant shows performance improvement.
...and 17 more figures

Theorems & Definitions (22)

Proposition 3.0
proof : Proof sketch
Proposition 3.0
proof : Proof sketch
Theorem 3.1: Nonconvex Nuclear-Norm Stationarity
proof : Proof Sketch
Proposition A.0
proof
Proposition A.0
proof
...and 12 more

NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

TL;DR

Abstract

NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (22)