Dion2: A Simple Method to Shrink Matrix in Muon
Kwangjun Ahn, Noah Amsel, John Langford
TL;DR
The paper tackles the scalability bottleneck of Muon's orthonormalization by introducing Dion2, a straightforward method that shrinks the matrix entering Newton–Schulz iterations by orthonormalizing only a fraction of rows or columns. It couples this submatrix approach with a selective decay (error-feedback) mechanism to preserve update quality. Empirical results at 300M and 1B parameters show that random submatrix selection can match or closely approach Muon's performance while reducing compute and communication costs, especially in data-parallel settings. The work highlights the surprising efficacy of sparse updates and calls for larger-scale validation to fully quantify practical benefits.
Abstract
The Muon optimizer enjoys strong empirical performance and theoretical grounding. However, the super-linear cost of its orthonormalization step introduces increasing overhead with scale. To alleviate this cost, several works have attempted to reduce the size of the matrix entering the orthonormalization step. We introduce Dion2, a much simpler method for shrinking the matrix involved in Muon's computation compared to prior approaches. At a high level, Dion2 selects a fraction of rows or columns at each iteration and orthonormalizes only those. This sampling procedure makes the update sparse, reducing both computation and communication costs which in turn improves the scalability of Muon.
