Table of Contents
Fetching ...

Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning

Thibaut Boissin, Thomas Massena, Franck Mamalet, Mathieu Serrurier

TL;DR

<3-5 sentence high-level summary> The paper addresses the computational bottleneck of orthogonalization in orthogonality-based optimizers like Muon by introducing Almost Orthogonal Preconditioning (AOL). AOL preconditioning accelerates the Newton-Schulz iterations and enables dropping one iteration, yielding up to 2.8x speedups in the NS approximation and 5-10% end-to-end training runtime gains in realistic language and vision tasks, without hyperparameter tuning. The authors provide a drop-in Turbo-Muon implementation with a fused Triton kernel, demonstrate robust performance across NanoGPT and CIFAR-10, and show that the remaining polar error remains manageable even under heavy-tailed gradient regimes. The work broadens the practical applicability of orthogonality-based optimization to medium-scale training by reducing overhead while preserving or improving convergence behavior.</paper_summary>

Abstract

Orthogonality-based optimizers, such as Muon, have recently shown strong performance across large-scale training and community-driven efficiency challenges. However, these methods rely on a costly gradient orthogonalization step. Even efficient iterative approximations such as Newton-Schulz remain expensive, typically requiring dozens of matrix multiplications to converge. We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost. We evaluate its impact and show that the overhead of our preconditioning can be made negligible. Furthermore, the faster convergence it enables allows us to remove one iteration out of the usual five without degrading approximation quality. Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation. We also show that this has a direct impact on end-to-end training runtime with 5-10% improvement in realistic training scenarios across two efficiency-focused tasks. On challenging language or vision tasks, we validate that our method maintains equal or superior model performance while improving runtime. Crucially, these improvements require no hyperparameter tuning and can be adopted as a simple drop-in replacement. Our code is publicly available on github.

Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning

TL;DR

<3-5 sentence high-level summary> The paper addresses the computational bottleneck of orthogonalization in orthogonality-based optimizers like Muon by introducing Almost Orthogonal Preconditioning (AOL). AOL preconditioning accelerates the Newton-Schulz iterations and enables dropping one iteration, yielding up to 2.8x speedups in the NS approximation and 5-10% end-to-end training runtime gains in realistic language and vision tasks, without hyperparameter tuning. The authors provide a drop-in Turbo-Muon implementation with a fused Triton kernel, demonstrate robust performance across NanoGPT and CIFAR-10, and show that the remaining polar error remains manageable even under heavy-tailed gradient regimes. The work broadens the practical applicability of orthogonality-based optimization to medium-scale training by reducing overhead while preserving or improving convergence behavior.</paper_summary>

Abstract

Orthogonality-based optimizers, such as Muon, have recently shown strong performance across large-scale training and community-driven efficiency challenges. However, these methods rely on a costly gradient orthogonalization step. Even efficient iterative approximations such as Newton-Schulz remain expensive, typically requiring dozens of matrix multiplications to converge. We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost. We evaluate its impact and show that the overhead of our preconditioning can be made negligible. Furthermore, the faster convergence it enables allows us to remove one iteration out of the usual five without degrading approximation quality. Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation. We also show that this has a direct impact on end-to-end training runtime with 5-10% improvement in realistic training scenarios across two efficiency-focused tasks. On challenging language or vision tasks, we validate that our method maintains equal or superior model performance while improving runtime. Crucially, these improvements require no hyperparameter tuning and can be adopted as a simple drop-in replacement. Our code is publicly available on github.

Paper Structure

This paper contains 33 sections, 2 theorems, 32 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

For any gradient matrix $G \in \mathbb{R}^{m \times n}_{\setminus 0}$ of SVD decomposition $G = U.\Sigma.V^T$ with $U$ and $V$ unitary, and any sharpness $\lambda > 0$, consider the problem: this problem is solved with a step size $\eta = \frac{1}{\lambda} tr (\Sigma)$ and an update: This solution is unique if and only if $G$ is of full rank. We assume $S$ to be a matrix whose diagonal is filled

Figures (11)

  • Figure 1: Practical implementations of orthogonalization face a trade-off between polar error and computation time. Thanks to preconditioning, our method drastically reduces the polar error of the Newton-Schulz algorithm. This improves the tradeoff between convergence and runtime, effectively lowering the overhead of the Muon optimizer.
  • Figure 2: Comparing pre-conditioning methods: AOL consistently outperforms the usual Frobenius normalization. When matrices get larger.
  • Figure 3: Turning preconditioning into runtime: Applying AOL before the algorithm improves its convergence speed (a). This can be used to remove an iteration while achieving a similar polar error. Removing one iteration out of 5 improves the runtime of the algorithm (b), making optimizers like Muon more scalable to large matrices.
  • Figure 4: Turbo-Muon can make realistic training faster without impact on final loss.\ref{['tab:1b_runtime']} shows that it can achieve non-negligible speedups on medium-scale training, with runtime improvements nearing 10% of the total step runtime. \ref{['tab:nanoGPT_accuracy']} shows that our approach does not induce perceptible loss degradation when an iteration is removed.
  • Figure 5: Understanding the nature of the remaining polar error. We decompose the polar error of the Turbo-Muon algorithm as an approximation error that depends on the number of NS iterations, along with a bias error that is introduced by AOL pre-conditioning. Results are measured on 100 random normal matrices.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Proposition 1: Turbo-Muon Steepest Descent
  • proof
  • Lemma 1
  • proof