Table of Contents
Fetching ...

AuON: A Linear-time Alternative to Orthogonal Momentum Updates

Dipan Maity

TL;DR

AuON proposes a linear-time optimizer that enforces a spectral-norm constraint on updates via hyperbolic-cosine RMS scaling, avoiding costly full orthogonalization. Its core innovations include a natural emergency brake that dampens spikes from exploding attention logits and a hybrid variant that adds a single Newton-Schulz step for partial decorrelation. Theoretical guarantees (spectral contraction, reduced correlation energy) and extensive experiments across language modeling and vision tasks show stable training, improved generalization, and near-parallel efficiency to SGD/AdamW. This work offers a practical, scalable alternative to MuON for geometry-aware optimization in large-scale transformer and related architectures.

Abstract

Orthogonal momentum gradient updates have emerged to overcome the limitations of vector-based optimizers like Adam. The vector-based optimizer Adam suffers from high memory costs and ill-conditioned momentum gradient updates. However, traditional Orthogonal momentum approaches, such as SVD/QR decomposition, suffer from high computational and memory costs and underperform compared to well-tuned SGD with momentum. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and approximate orthogonal matrices via Newton-Schulz iterations, which gives better GPU utilization, active high TFLOPS, and reduces memory usage by up to 3x. Nevertheless, Muon(Vanilla) suffers from exploding attention logits and has cubic computation complexity. In this paper, we deep dive into orthogonal momentum gradient updates to find the main properties that help Muon achieve remarkable performance. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without approximate orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. AuON has an automatic "emergency brake" to handle exploding attention logits. We further introduce a hybrid variant, Hybrid-AuON, that applies the linear transformations with Newton-Schulz iterations, which outperforms Muon in the language modeling tasks. Code is available at: https://github.com/ryyzn9/AuON

AuON: A Linear-time Alternative to Orthogonal Momentum Updates

TL;DR

AuON proposes a linear-time optimizer that enforces a spectral-norm constraint on updates via hyperbolic-cosine RMS scaling, avoiding costly full orthogonalization. Its core innovations include a natural emergency brake that dampens spikes from exploding attention logits and a hybrid variant that adds a single Newton-Schulz step for partial decorrelation. Theoretical guarantees (spectral contraction, reduced correlation energy) and extensive experiments across language modeling and vision tasks show stable training, improved generalization, and near-parallel efficiency to SGD/AdamW. This work offers a practical, scalable alternative to MuON for geometry-aware optimization in large-scale transformer and related architectures.

Abstract

Orthogonal momentum gradient updates have emerged to overcome the limitations of vector-based optimizers like Adam. The vector-based optimizer Adam suffers from high memory costs and ill-conditioned momentum gradient updates. However, traditional Orthogonal momentum approaches, such as SVD/QR decomposition, suffer from high computational and memory costs and underperform compared to well-tuned SGD with momentum. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and approximate orthogonal matrices via Newton-Schulz iterations, which gives better GPU utilization, active high TFLOPS, and reduces memory usage by up to 3x. Nevertheless, Muon(Vanilla) suffers from exploding attention logits and has cubic computation complexity. In this paper, we deep dive into orthogonal momentum gradient updates to find the main properties that help Muon achieve remarkable performance. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without approximate orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. AuON has an automatic "emergency brake" to handle exploding attention logits. We further introduce a hybrid variant, Hybrid-AuON, that applies the linear transformations with Newton-Schulz iterations, which outperforms Muon in the language modeling tasks. Code is available at: https://github.com/ryyzn9/AuON

Paper Structure

This paper contains 48 sections, 68 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Visualization of the Newton--Schulz process (0.5) over 5 iterations, compared with AuON and Hybrid-AuON(NS=1). The heatmaps (top) show progressive orthogonalization, with $M M^\top$ converging from a scattered structure (Step 0) to an identity-like diagonal (Step 5). The singular value plot (bottom) illustrates rapid convergence toward $1.0$, confirming orthogonalization.
  • Figure 2: comparison of computation efficiency of different methods on (nxn) random matrices, where Hybrid-Auon(NS=1)
  • Figure 3: The Emergency Brake in Action. The Muon optimizer (Red) exhibits step-function explosions, driving logits to hard saturation (greater than 30). In contrast, AuON (Blue) and AuON+5(Ns=5)(Green) demonstrate the theoretically predicted logarithmic braking curve, capping logit growth at approximately 12 even after 1,000 steps of unnormalized accumulation.
  • Figure 4: Thermodynamics of Optimization (20k Steps). The Muon optimizer (Red) exhibits high-variance chaotic behavior with a persistent upward drift, indicative of unregulated energy injection. In stark contrast, AuON (Blue) demonstrates a dissipative trajectory. After an initial rise due to the lack of normalization, the self-regulating cosh mechanism forces an inflection point at approximately step 2,500, after which the logit magnitudes steadily decay. This proves that AuON actively removes instability energy from the system over time.
  • Figure 5: Training Monitor: Correlation between loss instability (red) and gradient kurtosis (blue). Figure 1: Temporal correlation of instability. The training trajectory reveals that significant spikes in training loss (instability events, e.g., at steps 24, 40, and 46) are inextricably linked to simultaneous or preceding spikes in gradient kurtosis. This establishes high kurtosis as a leading indicator of divergence in real-world optimization.
  • ...and 7 more figures