Table of Contents
Fetching ...

On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Ruihan Xu, Jiajin Li, Yiping Lu

TL;DR

This work proposes MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths and shows that MOGA is competitive with Muon while being notably faster in large-token and low-loss regimes.

Abstract

A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width $w$ increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard $p \to q$ operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted $\pmean \to \qmean$, that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emph{rescaled} \textrm{AdamW}, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover $μ$P scaling~\cite{yang2021tensor} as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrm{Muon} can suffer an $\mathcal{O}(\sqrt{w})$ worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.

On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

TL;DR

This work proposes MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths and shows that MOGA is competitive with Muon while being notably faster in large-token and low-loss regimes.

Abstract

A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted , that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emph{rescaled} \textrm{AdamW}, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover P scaling~\cite{yang2021tensor} as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrm{Muon} can suffer an worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.
Paper Structure (28 sections, 8 theorems, 105 equations, 9 figures, 1 algorithm)

This paper contains 28 sections, 8 theorems, 105 equations, 9 figures, 1 algorithm.

Key Result

Proposition 1

Consider the steepest-descent subproblem eq:steep_descent with gradient $\bm{G}=\nabla f(\bm{\Theta})\in\mathbb{R}^{m\times n}$. For the operator norms $\|\cdot\|_{1\to q}$ and $\|\cdot\|_{p\to\infty}$ with $p,q\ge 1$, the corresponding steepest-descent directions admit the following closed-form exp Here $\odot$ denotes elementwise multiplication.

Figures (9)

  • Figure 1: Operators Should Play Nice Together. Chaining layer-wise stability bounds under $\|\cdot\|_{p \to q}$ requires $\|\cdot\|_p \le \|\cdot\|_q$. This fails for classical $p\to q$ norms when $p\leq q$ but holds for $(p,\textrm{mean}) \to (q,\textrm{mean})$ norms, yielding dimension-independent bounds.
  • Figure 2: Embedding layer geometry. One-hot inputs place embedding training in the $1 \to {(q,\textrm{mean})}$ geometry, where both column normalization and SignGD are effective.
  • Figure 3: MOGA (p=1.5)
  • Figure 4: MOGA (p=2)
  • Figure 5: MOGA (p=3)
  • ...and 4 more figures

Theorems & Definitions (26)

  • Definition 1: Matrix Operator Norm
  • Definition 2: Feedforward Neural Network
  • Remark 1
  • Proposition 1
  • proof
  • proof
  • Theorem 1: Width-independent Lipschitz bound under mean-normalized geometry
  • Remark 2: Why the bounded parameter set is natural
  • proof
  • Definition 3: $L$-Smoothness
  • ...and 16 more