AuON: A Linear-time Alternative to Orthogonal Momentum Updates
Dipan Maity
TL;DR
AuON proposes a linear-time optimizer that enforces a spectral-norm constraint on updates via hyperbolic-cosine RMS scaling, avoiding costly full orthogonalization. Its core innovations include a natural emergency brake that dampens spikes from exploding attention logits and a hybrid variant that adds a single Newton-Schulz step for partial decorrelation. Theoretical guarantees (spectral contraction, reduced correlation energy) and extensive experiments across language modeling and vision tasks show stable training, improved generalization, and near-parallel efficiency to SGD/AdamW. This work offers a practical, scalable alternative to MuON for geometry-aware optimization in large-scale transformer and related architectures.
Abstract
Orthogonal momentum gradient updates have emerged to overcome the limitations of vector-based optimizers like Adam. The vector-based optimizer Adam suffers from high memory costs and ill-conditioned momentum gradient updates. However, traditional Orthogonal momentum approaches, such as SVD/QR decomposition, suffer from high computational and memory costs and underperform compared to well-tuned SGD with momentum. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and approximate orthogonal matrices via Newton-Schulz iterations, which gives better GPU utilization, active high TFLOPS, and reduces memory usage by up to 3x. Nevertheless, Muon(Vanilla) suffers from exploding attention logits and has cubic computation complexity. In this paper, we deep dive into orthogonal momentum gradient updates to find the main properties that help Muon achieve remarkable performance. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without approximate orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. AuON has an automatic "emergency brake" to handle exploding attention logits. We further introduce a hybrid variant, Hybrid-AuON, that applies the linear transformations with Newton-Schulz iterations, which outperforms Muon in the language modeling tasks. Code is available at: https://github.com/ryyzn9/AuON
