Table of Contents
Fetching ...

Dion2: A Simple Method to Shrink Matrix in Muon

Kwangjun Ahn, Noah Amsel, John Langford

TL;DR

The paper tackles the scalability bottleneck of Muon's orthonormalization by introducing Dion2, a straightforward method that shrinks the matrix entering Newton–Schulz iterations by orthonormalizing only a fraction of rows or columns. It couples this submatrix approach with a selective decay (error-feedback) mechanism to preserve update quality. Empirical results at 300M and 1B parameters show that random submatrix selection can match or closely approach Muon's performance while reducing compute and communication costs, especially in data-parallel settings. The work highlights the surprising efficacy of sparse updates and calls for larger-scale validation to fully quantify practical benefits.

Abstract

The Muon optimizer enjoys strong empirical performance and theoretical grounding. However, the super-linear cost of its orthonormalization step introduces increasing overhead with scale. To alleviate this cost, several works have attempted to reduce the size of the matrix entering the orthonormalization step. We introduce Dion2, a much simpler method for shrinking the matrix involved in Muon's computation compared to prior approaches. At a high level, Dion2 selects a fraction of rows or columns at each iteration and orthonormalizes only those. This sampling procedure makes the update sparse, reducing both computation and communication costs which in turn improves the scalability of Muon.

Dion2: A Simple Method to Shrink Matrix in Muon

TL;DR

The paper tackles the scalability bottleneck of Muon's orthonormalization by introducing Dion2, a straightforward method that shrinks the matrix entering Newton–Schulz iterations by orthonormalizing only a fraction of rows or columns. It couples this submatrix approach with a selective decay (error-feedback) mechanism to preserve update quality. Empirical results at 300M and 1B parameters show that random submatrix selection can match or closely approach Muon's performance while reducing compute and communication costs, especially in data-parallel settings. The work highlights the surprising efficacy of sparse updates and calls for larger-scale validation to fully quantify practical benefits.

Abstract

The Muon optimizer enjoys strong empirical performance and theoretical grounding. However, the super-linear cost of its orthonormalization step introduces increasing overhead with scale. To alleviate this cost, several works have attempted to reduce the size of the matrix entering the orthonormalization step. We introduce Dion2, a much simpler method for shrinking the matrix involved in Muon's computation compared to prior approaches. At a high level, Dion2 selects a fraction of rows or columns at each iteration and orthonormalizes only those. This sampling procedure makes the update sparse, reducing both computation and communication costs which in turn improves the scalability of Muon.

Paper Structure

This paper contains 12 sections, 6 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: We propose a simple method to reduce the size of the matrix entering Muon’s Newton-Schulz iterations while preserving Muon’s high update quality. Left: Shrinking the matrix leads to faster time per step (compute-only benchmark). Right: Even orthonormalizing only 25% of the matrix maintains update quality close to full Muon at the 1B-model / 100B-token training scale (final losses: Muon 2.623 vs. $0.25$-Dion2 2.635).
  • Figure 2: 300M model trained on 20B FineWeb.Left: comparison of Muon and Dion2 implemented with $\ell_1$-norm selection. Right: comparison of different selection methods in Dion2. Final losses are nearly identical. For the $\ell_1$-norm vs. random comparison: ($\alpha = 0.5$) $2.9154$ vs. $2.9148$, ($\alpha 0.25$) $2.9262$ vs. $2.9296$, ($\alpha = 0.125$) $2.9452$ vs. $2.9469$.
  • Figure 3: Dion vs. Dion2 ($\ell_1$ selection) on the 300M model.Left: zoomed-in view of validation loss. Right: zoomed-out view. Dion2 achieves a better trade-off in update quality. Notably, it initially lags behind Dion but eventually catches up and surpasses it.
  • Figure 4: Error-Feedback Ablation.Dion2 decays only the selected rows or columns of the momentum matrix. We ablate this component by comparing against a variant that decays all rows/columns. The results show that the selective decay mechanism is critical for performance.
  • Figure 5: 1B model trained on 100B FineWeb. Left: comparison of Muon and Dion2. Right: comparison of different selection methods in Dion2.