ARO: A New Lens On Matrix Optimization For Large Models
Wenbo Gong, Javier Zazo, Qijun Luo, Puqian Wang, James Hensman, Chao Ma
TL;DR
The paper introduces Adaptively Rotated Optimization (ARO), a matrix-optimization framework that treats gradient rotation as a fundamental design principle to surpass orthogonalization-based methods in large-model pretraining. By rotating gradients in conjunction with a base projection f_t and adapting the rotation via momentum-informed geometry, ARO achieves consistent speedups over AdamW and Muon across model families (dense and MoE) up to 8B parameters, with controlled benchmarking and no clear diminishing returns. The authors connect ARO to a symmetry-teleportation perspective, showing that gradient rotations align with rotational symmetries of residual streams in transformers, and offer practical extensions such as full-model rotation, cross-layer coupling, and scalable rotation estimation via shifted Cholesky QR. Empirically, ARO-Sinkhorn emerges as the strongest variant, delivering up to ~1.3x speedup on GPT2/Xl and Sigma-MoE regimes, while maintaining comparable throughput to baseline optimizers. The work argues for a symmetry-driven view of matrix optimization, providing design principles and preliminary validations that rotations and architecture-induced symmetries can jointly drive more efficient, robust training of very large language models.
Abstract
Matrix-based optimizers have attracted growing interest for improving LLM training efficiency, with significant progress centered on orthogonalization/whitening based methods. While yielding substantial performance gains, a fundamental question arises: can we develop new paradigms beyond orthogonalization, pushing the efficiency frontier further? We present \textbf{Adaptively Rotated Optimization (ARO}, a new matrix optimization framework that treats gradient rotation as a first class design principle. ARO accelerates LLM training by performing normed steepest descent in a rotated coordinate system, where the rotation is determined by a novel norm-informed policy. This perspective yields update rules that go beyond existing orthogonalization and whitening optimizers, improving sample efficiency in practice. To make comparisons reliable, we propose a rigorously controlled benchmarking protocol that reduces confounding and bias. Under this protocol, ARO consistently outperforms AdamW (by 1.3 $\sim$1.35$\times$) and orthogonalization methods (by 1.1$\sim$1.15$\times$) in LLM pretraining at up to 8B activated parameters, and up to $8\times$ overtrain budget, without evidence of diminishing returns. Finally, we discuss how ARO can be reformulated as a symmetry-aware optimizer grounded in rotational symmetries of residual streams, motivating advanced designs that enable computationally efficient exploitation of cross-layer/cross module couplings.
