Table of Contents
Fetching ...

REG: A Regularization Optimizer for Robust Training Dynamics

Zehua Liu, Han Wu, Xiaojin Fu, Shuqi Liu, Xiongwei Han, Tao Zhong, Mingxuan Yuan

TL;DR

The proposed REG optimizer, a novel optimizer that replaces Muon's aggressive matrix sign operator with the Row-and-Column-Scaling (RACS) operator, achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm.

Abstract

Optimizers are crucial for the efficient training of Large Language Models (LLMs). While AdamW is the de facto standard, recent structure-aware optimizers like Muon have emerged, which regularize gradient updates by operating on entire weight matrices. The Muon optimizer balances the gradient updates along all the directions. However, Muon's reliance on the matrix sign function can lead to training instability, exhibits incompatibility when fine-tuning models pre-trained with AdamW. To address these limitations, we propose \textbf{REG}, a novel optimizer that replaces Muon's aggressive matrix sign operator with the Row-and-Column-Scaling (RACS) operator. Theoretically grounded in balancing a matrix, the RACS operator regularizes the update steps in a less drastic manner, making it simpler to implement and more compatible with established training dynamics. Through extensive empirical experiments on LLM training, we demonstrate that our REG optimizer not only achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm. This consistency is particularly evident during the fine-tuning stage, where REG optimizer avoids the performance degradation observed with Muon.

REG: A Regularization Optimizer for Robust Training Dynamics

TL;DR

The proposed REG optimizer, a novel optimizer that replaces Muon's aggressive matrix sign operator with the Row-and-Column-Scaling (RACS) operator, achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm.

Abstract

Optimizers are crucial for the efficient training of Large Language Models (LLMs). While AdamW is the de facto standard, recent structure-aware optimizers like Muon have emerged, which regularize gradient updates by operating on entire weight matrices. The Muon optimizer balances the gradient updates along all the directions. However, Muon's reliance on the matrix sign function can lead to training instability, exhibits incompatibility when fine-tuning models pre-trained with AdamW. To address these limitations, we propose \textbf{REG}, a novel optimizer that replaces Muon's aggressive matrix sign operator with the Row-and-Column-Scaling (RACS) operator. Theoretically grounded in balancing a matrix, the RACS operator regularizes the update steps in a less drastic manner, making it simpler to implement and more compatible with established training dynamics. Through extensive empirical experiments on LLM training, we demonstrate that our REG optimizer not only achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm. This consistency is particularly evident during the fine-tuning stage, where REG optimizer avoids the performance degradation observed with Muon.

Paper Structure

This paper contains 24 sections, 4 theorems, 56 equations, 2 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Let $M \in \mathbb{R}^{m \times n}$ be a matrix. In the following, the norm $\| \cdot \|^*$ may be any Hölder norm or the Frobenius norm. (a) If $\kappa (M) := \| M \|_\infty / \| M \|^*$, then $\kappa (DM)$ is minimal if all rows in $DM$ have equal $1$-norm. (b) If $\kappa (M) := \| M \|_1 / \| M \

Figures (2)

  • Figure 1: Training loss and accuracy curves on the CIFAR-100 image classification task.
  • Figure 2: Training loss curves for AdamW and the regularized optimizer (REG) on the openwebtext-100k dataset.

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2
  • Proof 1
  • Theorem 3
  • Theorem 4: Convergence of Row-Normalized Gradient Descent with Momentum
  • Proof 2: Proof of Theorem \ref{['thm:3']}
  • Proof 3: Proof of Theorem \ref{['thm:4']}