RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Shenyang Deng; Zhuoli Ouyang; Tianyu Pang; Zihang Liu; Ruochen Jin; Shuhua Yu; Yaoqing Yang

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, Yaoqing Yang

Abstract

Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape . The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, \textsc{Muon} stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of \textsc{Muon} still leaves room for further improvement. In this paper, we introduce \textsc{RMNP} (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per-iteration computational complexity from $\mathcal{O}(mn\cdot\min(m,n))$ to $\mathcal{O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for \textsc{RMNP} in the non-convex setting that match recent results for \textsc{Muon} optimizers, achieving the information-theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that \textsc{RMNP} delivers competitive optimization performance compared with \textsc{Muon} while substantially reducing preconditioning wall-clock time. Our code is available at \href{https://anonymous.4open.science/r/RMNP-E8E1/}{this link}.

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Abstract

normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per-iteration computational complexity from

for an

weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for \textsc{RMNP} in the non-convex setting that match recent results for \textsc{Muon} optimizers, achieving the information-theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that \textsc{RMNP} delivers competitive optimization performance compared with \textsc{Muon} while substantially reducing preconditioning wall-clock time. Our code is available at \href{https://anonymous.4open.science/r/RMNP-E8E1/}{this link}.

Paper Structure (43 sections, 16 theorems, 129 equations, 14 figures, 13 tables)

This paper contains 43 sections, 16 theorems, 129 equations, 14 figures, 13 tables.

Introduction
Related Work
Preconditioned Optimization Algorithms
Hessian Properties of Neural Networks
Convergence Analysis of Adaptive Algorithms
Method
RMNP Preconditioner
Analysis of Muon Preconditioner
Discussion with Recent Work
Main Experimental Results
Experimental Setup
Muon
RMNP
AdamW
GPT-2 Pre-Training on OpenWebText
...and 28 more sections

Key Result

Theorem 5.5

Under Assumptions assump:lipschitz(a), assump:unbiased, assump:variance, and assump:lower, if Algorithm algoRMNP uses constant $\eta_t = \eta$ and momentum $\beta \in [0,1)$, then

Figures (14)

Figure 1: Muonjordan2024muon
Figure 2: RMNP
Figure 3: Time overhead comparison. The figure illustrates the wall-clock time for 100 computation steps for preconditioning process of RMNP versus Muon.
Figure 4: Comparison of Convergence Results. $L_F,L_\ast$ denotes the corresponding smoothness coefficient and $\|\nabla f\|_F, \|\nabla f\|_\ast$ the corresponding convergence criterion.
Figure 5: Comparison among Transformer layerwise Hessian, Preconditioner for Muon , and Preconditioner for RMNP. The figure of Transformer layerwise Hessian is conceptual, the real case can be widely found in zhang2024transformersdong2025towards. $P=(V_tV_t^T)^{\frac{1}{2}}$, $m$ and $n$ are the number of rows and columns of the weight matrix, respectively. In Section \ref{['subsec:diagonal_dominance']} we further verified through experiments that the Muon preconditioner has such a certain diagonal dominance property.
...and 9 more figures

Theorems & Definitions (31)

Theorem 5.5: $\|\cdot\|_F$- Lipschitz
Remark 5.6: Complexity for Theorem \ref{['thm:fro-convergence']}
Theorem 5.7: $\|\cdot\|_{1,2}$-Convergence under $\|\cdot\|_F$-Lipschitz
Remark 5.8: Complexity for Theorem \ref{['thm:12-fro-convergence']}
Theorem 5.9: $\|\cdot\|_{1,2}$-Lipschitz
Remark 5.10: Complexity for Theorem \ref{['thm:inf2-convergence']}
Lemma A.1
proof
Lemma A.2
proof
...and 21 more

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Abstract

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Authors

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (31)