TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

Peng Cheng; Jiucheng Zang; Qingnan Li; Liheng Ma; Yufei Cui; Yingxue Zhang; Boxing Chen; Ming Jian; Wen Tong

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

Peng Cheng, Jiucheng Zang, Qingnan Li, Liheng Ma, Yufei Cui, Yingxue Zhang, Boxing Chen, Ming Jian, Wen Tong

TL;DR

It is demonstrated that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers, and TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone.

Abstract

Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

TL;DR

Abstract

Paper Structure (100 sections, 4 theorems, 62 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 100 sections, 4 theorems, 62 equations, 11 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Diagonal preconditioning and Adam-style optimizers.
Matrix and block-structured preconditioning beyond diagonal adaptivity.
Orthogonalization-based directions and Muon-style updates.
Trust-region magnitude control, clipping, and effective-time averaging.
Methodology
TrasMuon Algorithm
Orthogonalized direction.
Row-wise scaling calibration.
Trust-region clipping.
Schedule-free temporal smoothing.
Convergence Analysis
Damping-only contraction.
RMS calibration.
...and 85 more sections

Key Result

Lemma 3.1

For any $A\in\mathbb{R}^{m\times n}$ and any $c\in[0,1]^n$, $\|A\,\mathrm{diag}(c)\|_F \le \|A\|_F$.

Figures (11)

Figure 1: Early-stage training dynamics for Qwen3-0.6B from scratch (steps 0--350) under a warmup-stable-decay schedule, comparing (a) warmup-enabled runs and (b) warmup-free runs. The curves are smoothed using a time-weighted exponential moving average (EMA) with smoothing factor $0.1$ for better visualization.
Figure 2: ViT-Base training on ImageNet-100. Multi-seed results (mean $\pm$ std over three seeds: 42, 43, 44) for 4 optimizers. Shaded regions denote variability across seeds.
Figure 3: PINN Helmholtz ($k{=}2$) under random ROI sampling shifts. Curves show the mean over seeds, and shaded regions indicate variability across seeds.
Figure 4: Column outlier injection. Loss trajectories in a window around an outlier event. Vertical markers indicate outlier steps.
Figure 5: Closed-loop clipping evidence. Outlier events increase the column-energy ratio in log-scale (top), followed by stronger feature-wise clipping in the applied coefficients (bottom; $c_{\mathrm{used,min}}$).
...and 6 more figures

Theorems & Definitions (7)

Lemma 3.1: Damping-only contraction
proof
Lemma 3.2: Row-wise RMS calibration
proof
Lemma 3.6: Smoothness descent
Theorem 3.7: Expected stationarity for RMS-calibrated, damping-only updates
proof

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

TL;DR

Abstract

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (7)