Table of Contents
Fetching ...

ARO: A New Lens On Matrix Optimization For Large Models

Wenbo Gong, Javier Zazo, Qijun Luo, Puqian Wang, James Hensman, Chao Ma

TL;DR

The paper introduces Adaptively Rotated Optimization (ARO), a matrix-optimization framework that treats gradient rotation as a fundamental design principle to surpass orthogonalization-based methods in large-model pretraining. By rotating gradients in conjunction with a base projection f_t and adapting the rotation via momentum-informed geometry, ARO achieves consistent speedups over AdamW and Muon across model families (dense and MoE) up to 8B parameters, with controlled benchmarking and no clear diminishing returns. The authors connect ARO to a symmetry-teleportation perspective, showing that gradient rotations align with rotational symmetries of residual streams in transformers, and offer practical extensions such as full-model rotation, cross-layer coupling, and scalable rotation estimation via shifted Cholesky QR. Empirically, ARO-Sinkhorn emerges as the strongest variant, delivering up to ~1.3x speedup on GPT2/Xl and Sigma-MoE regimes, while maintaining comparable throughput to baseline optimizers. The work argues for a symmetry-driven view of matrix optimization, providing design principles and preliminary validations that rotations and architecture-induced symmetries can jointly drive more efficient, robust training of very large language models.

Abstract

Matrix-based optimizers have attracted growing interest for improving LLM training efficiency, with significant progress centered on orthogonalization/whitening based methods. While yielding substantial performance gains, a fundamental question arises: can we develop new paradigms beyond orthogonalization, pushing the efficiency frontier further? We present \textbf{Adaptively Rotated Optimization (ARO}, a new matrix optimization framework that treats gradient rotation as a first class design principle. ARO accelerates LLM training by performing normed steepest descent in a rotated coordinate system, where the rotation is determined by a novel norm-informed policy. This perspective yields update rules that go beyond existing orthogonalization and whitening optimizers, improving sample efficiency in practice. To make comparisons reliable, we propose a rigorously controlled benchmarking protocol that reduces confounding and bias. Under this protocol, ARO consistently outperforms AdamW (by 1.3 $\sim$1.35$\times$) and orthogonalization methods (by 1.1$\sim$1.15$\times$) in LLM pretraining at up to 8B activated parameters, and up to $8\times$ overtrain budget, without evidence of diminishing returns. Finally, we discuss how ARO can be reformulated as a symmetry-aware optimizer grounded in rotational symmetries of residual streams, motivating advanced designs that enable computationally efficient exploitation of cross-layer/cross module couplings.

ARO: A New Lens On Matrix Optimization For Large Models

TL;DR

The paper introduces Adaptively Rotated Optimization (ARO), a matrix-optimization framework that treats gradient rotation as a fundamental design principle to surpass orthogonalization-based methods in large-model pretraining. By rotating gradients in conjunction with a base projection f_t and adapting the rotation via momentum-informed geometry, ARO achieves consistent speedups over AdamW and Muon across model families (dense and MoE) up to 8B parameters, with controlled benchmarking and no clear diminishing returns. The authors connect ARO to a symmetry-teleportation perspective, showing that gradient rotations align with rotational symmetries of residual streams in transformers, and offer practical extensions such as full-model rotation, cross-layer coupling, and scalable rotation estimation via shifted Cholesky QR. Empirically, ARO-Sinkhorn emerges as the strongest variant, delivering up to ~1.3x speedup on GPT2/Xl and Sigma-MoE regimes, while maintaining comparable throughput to baseline optimizers. The work argues for a symmetry-driven view of matrix optimization, providing design principles and preliminary validations that rotations and architecture-induced symmetries can jointly drive more efficient, robust training of very large language models.

Abstract

Matrix-based optimizers have attracted growing interest for improving LLM training efficiency, with significant progress centered on orthogonalization/whitening based methods. While yielding substantial performance gains, a fundamental question arises: can we develop new paradigms beyond orthogonalization, pushing the efficiency frontier further? We present \textbf{Adaptively Rotated Optimization (ARO}, a new matrix optimization framework that treats gradient rotation as a first class design principle. ARO accelerates LLM training by performing normed steepest descent in a rotated coordinate system, where the rotation is determined by a novel norm-informed policy. This perspective yields update rules that go beyond existing orthogonalization and whitening optimizers, improving sample efficiency in practice. To make comparisons reliable, we propose a rigorously controlled benchmarking protocol that reduces confounding and bias. Under this protocol, ARO consistently outperforms AdamW (by 1.3 1.35) and orthogonalization methods (by 1.11.15) in LLM pretraining at up to 8B activated parameters, and up to overtrain budget, without evidence of diminishing returns. Finally, we discuss how ARO can be reformulated as a symmetry-aware optimizer grounded in rotational symmetries of residual streams, motivating advanced designs that enable computationally efficient exploitation of cross-layer/cross module couplings.
Paper Structure (168 sections, 4 theorems, 285 equations, 26 figures, 5 tables, 3 algorithms)

This paper contains 168 sections, 4 theorems, 285 equations, 26 figures, 5 tables, 3 algorithms.

Key Result

Theorem C.1

Assume the parameter update takes the form $\Delta {\bm{W}} \propto {\bm{R}} f({\bm{R}}^T{\bm{M}})$ (eq: ARO update), and define and Then $\Delta {\bm{W}}$ is guaranteed to misalign with $\bar{{\bm{G}}}$ when $\cos(\theta_{GM})<-\frac{1}{\sqrt{1+k^2}}$ and momentum misaligns with $\bar{{\bm{G}}}$.

Figures (26)

  • Figure 1: Scaling results preview of ARO under a rigorous, controlled benchmarking protocol. (a)ARO delivers consistent, non-diminishing speedups over AdamW and Muon across model scales (up to 8B activated parameters) and training budgets (up to $8\times$ overtrain, denoted by $\texttt{OT}$). (b) In the Sigma-MoE-2B hu2025sigmamoetinytechnicalreport example, ARO consistently outperforms AdamW, orthogonalization-based methods (Muon, Dion), as well as conventional eigenvector-rotated optimizers (Eigen) across overtraining budgets. Moreover, ARO is applicable to all matrix parameters (ARO full-model mode, including embeddings and LM heads), yielding better long-horizon performance than ARO applied to hidden layers only.
  • Figure 2: Properties of ARO rotations. (a)-(b) ARO rotations give higher rotation objective value. During 130M nanoGPT training with ARO using SinkGD scetbon2025gradient or SignGD bernstein2018signsgd as base optimizers, we evaluate at each step $t$ and for each momentum matrix ${\bm{M}}_t$ the objective $\mathcal{J}({\bm{R}};{\bm{M}}_t,f)$ for the ARO rotation ${\bm{R}}_t^{\text{ARO}}$ (\ref{['eq: gf_r']}) and the exact eigen-rotation baseline ${\bm{R}}_t^{\text{eig}}$ (\ref{['eq: eig_rotation']}). (a) shows the mean relative gain $(\mathcal{J}({\bm{R}}_t^{\textsc{aro}})-\mathcal{J}({\bm{R}}_t^{\text{eig}}))/|\mathcal{J}({\bm{R}}_t^{\text{eig}})|$ (percent) over all parameters, and (b) shows the fraction of parameters with $\mathcal{J}({\bm{R}}_t^{\textsc{aro}})>\mathcal{J}({\bm{R}}_t^{\text{eig}})$. ARO scores higher for most parameters throughout training. (c)-(d) ARO updates are non-orthonormal. We plot the singular value spectrum (normalized to have maximum 1) of the update matrix on a random Gaussian gradient. (c) Both ARO-Sinkhorn and ARO-Sign updates deviate from orthogonalization. (d) Using ${\bm{R}}_t^{\text{eig}}$ instead over SinkGD/SignGD base optimizers yields rotated updates closer to orthonormal, with Eigen-Sink exactly orthonormal. Exact definitions of baselines can be found in \ref{['sec: baselines']}.
  • Figure 3: Comparing ARO rotation with eigen-rotations, across various of base optimizers. Results indicates that the non-eigen-rotations of ARO consistently outperform eigen-rotations, and improves all base optimizers to a competitive level. Experiments were done on GPT2-124M model. By default, we use SCQR implementation across all ARO and eigen-rotation methods. Ablations on QR implementation can be found in \ref{['fig: m2.1']}.
  • Figure 4: The impact of the choice of rotation policies and base optimizers on training loss. Rotations have significant impact on performance.
  • Figure 5: The performance of different rotation policies under both SCQR (left) and QR (right) implementations, across different base optimizers. RowNorm family is omitted as ARO coincide with eigen-rotation, as shown in \ref{['sec: ortho_as_special_case']}. The results suggests that ARO not only provides a better rotation direction, but also enables fast QR computation by improving the conditioning.
  • ...and 21 more figures

Theorems & Definitions (35)

  • Remark 1: momentum-first design of normed steepest descent
  • Remark 2: Stateless and stateful projection functions
  • Remark 3: momentum-first design of rotations
  • Remark 4: Discrete case
  • Remark 5: ARO has no effect on rotational-equivariant $f_t$
  • Remark 6: Understanding gradient orthogonalization
  • Remark 7: Imperfect update of internal states
  • Remark 8: Overlapping communication and computation.
  • Remark 9: Practical throughput
  • Remark 10
  • ...and 25 more