Table of Contents
Fetching ...

Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

Jinghui Yuan, Jiaxuan Zou, Shuo Wang, Yong Liu, Feiping Nie

Abstract

Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of $\mathcal{O}(mn)$. Furthermore, we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems. With a streamlined implementation requiring only two lines of code, our preliminary experiments validate Nora as an efficient and highly promising optimizer for large-scale training.

Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

Abstract

Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of . Furthermore, we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems. With a streamlined implementation requiring only two lines of code, our preliminary experiments validate Nora as an efficient and highly promising optimizer for large-scale training.

Paper Structure

This paper contains 34 sections, 13 theorems, 124 equations, 2 figures, 10 tables, 1 algorithm.

Key Result

Theorem 4.1

Consider a neural network layer defined by $h = wx$, where the weights $w \in \mathbb{R}^{m \times n}$. Assume the input activation $x \in \mathbb{R}^n$ satisfies the scaling hypothesis under standard deep learning initialization: $\|x\|_2 \le \gamma \sqrt{n}$, where $\gamma = \Theta(1)$ is a consta

Figures (2)

  • Figure 1: Training dynamics on the 135M model. Left: loss over training steps. Right: perplexity over training steps. Nora continues to improve late in training and finishes with the lowest loss and perplexity.
  • Figure 2: Training dynamics on the 135M model. This figure illustrates the perplexity and loss decay curves for Nora (default $\text{weight decay}=0$) in comparison with Mano (under $\text{weight decay}=0$ and $\text{weight decay}=0.1$).

Theorems & Definitions (17)

  • Theorem 4.1: Nora Scaling under the Scaling Hypothesis
  • Theorem 4.2: Asymptotic Convergence
  • Assumption 4.3: Smoothness
  • Theorem 4.4: Nora under matched $(\infty,2)$-smoothness
  • Proposition 4.5: Frobenius-smooth counterparts
  • Corollary 4.6: Standard first-order stationarity under row-wise scale invariance
  • Lemma A.1: Row-wise projection is non-expansive
  • Lemma A.2: Basic geometry of Nora
  • Lemma A.3: Descent under Frobenius smoothness
  • Lemma A.4: Descent under matched $(\infty,2)$-smoothness
  • ...and 7 more