On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Ruihan Xu; Jiajin Li; Yiping Lu

On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Ruihan Xu, Jiajin Li, Yiping Lu

TL;DR

This work proposes MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths and shows that MOGA is competitive with Muon while being notably faster in large-token and low-loss regimes.

Abstract

A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width $w$ increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard $p \to q$ operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted $\pmean \to \qmean$, that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emph{rescaled} \textrm{AdamW}, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover $μ$P scaling~\cite{yang2021tensor} as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrm{Muon} can suffer an $\mathcal{O}(\sqrt{w})$ worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.

On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

TL;DR

Abstract

A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width

increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard

operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted

, that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emph{rescaled} \textrm{AdamW}, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover

P scaling~\cite{yang2021tensor} as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrm{Muon} can suffer an

worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.

Paper Structure (28 sections, 8 theorems, 105 equations, 9 figures, 1 algorithm)

This paper contains 28 sections, 8 theorems, 105 equations, 9 figures, 1 algorithm.

Introduction
Notation.
Matrix Thinking: Unifying Optimizers via Matrix Operator Norm
Computability and per-iteration cost.
Convergence speed.
From AdamW, Muon to Row and Column Normalization
Width-independent Lipschitz Bound under Mean-Normalized Geometry
Width-independent Smootheness Bound under Mean-Normalized Geometry
MOGA Optimizer
Generalization to Transformer
Input Word Embedding and Positional Embeddings
Biases and LayerNorm Parameters
Self-Attention
MLP
Word Unembedding
...and 13 more sections

Key Result

Proposition 1

Consider the steepest-descent subproblem eq:steep_descent with gradient $\bm{G}=\nabla f(\bm{\Theta})\in\mathbb{R}^{m\times n}$. For the operator norms $\|\cdot\|_{1\to q}$ and $\|\cdot\|_{p\to\infty}$ with $p,q\ge 1$, the corresponding steepest-descent directions admit the following closed-form exp Here $\odot$ denotes elementwise multiplication.

Figures (9)

Figure 1: Operators Should Play Nice Together. Chaining layer-wise stability bounds under $\|\cdot\|_{p \to q}$ requires $\|\cdot\|_p \le \|\cdot\|_q$. This fails for classical $p\to q$ norms when $p\leq q$ but holds for $(p,\textrm{mean}) \to (q,\textrm{mean})$ norms, yielding dimension-independent bounds.
Figure 2: Embedding layer geometry. One-hot inputs place embedding training in the $1 \to {(q,\textrm{mean})}$ geometry, where both column normalization and SignGD are effective.
Figure 3: MOGA (p=1.5)
Figure 4: MOGA (p=2)
Figure 5: MOGA (p=3)
...and 4 more figures

Theorems & Definitions (26)

Definition 1: Matrix Operator Norm
Definition 2: Feedforward Neural Network
Remark 1
Proposition 1
proof
proof
Theorem 1: Width-independent Lipschitz bound under mean-normalized geometry
Remark 2: Why the bounded parameter set is natural
proof
Definition 3: $L$-Smoothness
...and 16 more

On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

TL;DR

Abstract

On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (26)