Table of Contents
Fetching ...

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

Peng Cheng, Jiucheng Zang, Qingnan Li, Liheng Ma, Yufei Cui, Yingxue Zhang, Boxing Chen, Ming Jian, Wen Tong

TL;DR

It is demonstrated that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers, and TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone.

Abstract

Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

TL;DR

It is demonstrated that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers, and TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone.

Abstract

Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.
Paper Structure (100 sections, 4 theorems, 62 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 100 sections, 4 theorems, 62 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Lemma 3.1

For any $A\in\mathbb{R}^{m\times n}$ and any $c\in[0,1]^n$, $\|A\,\mathrm{diag}(c)\|_F \le \|A\|_F$.

Figures (11)

  • Figure 1: Early-stage training dynamics for Qwen3-0.6B from scratch (steps 0--350) under a warmup-stable-decay schedule, comparing (a) warmup-enabled runs and (b) warmup-free runs. The curves are smoothed using a time-weighted exponential moving average (EMA) with smoothing factor $0.1$ for better visualization.
  • Figure 2: ViT-Base training on ImageNet-100. Multi-seed results (mean $\pm$ std over three seeds: 42, 43, 44) for 4 optimizers. Shaded regions denote variability across seeds.
  • Figure 3: PINN Helmholtz ($k{=}2$) under random ROI sampling shifts. Curves show the mean over seeds, and shaded regions indicate variability across seeds.
  • Figure 4: Column outlier injection. Loss trajectories in a window around an outlier event. Vertical markers indicate outlier steps.
  • Figure 5: Closed-loop clipping evidence. Outlier events increase the column-energy ratio in log-scale (top), followed by stronger feature-wise clipping in the applied coefficients (bottom; $c_{\mathrm{used,min}}$).
  • ...and 6 more figures

Theorems & Definitions (7)

  • Lemma 3.1: Damping-only contraction
  • proof
  • Lemma 3.2: Row-wise RMS calibration
  • proof
  • Lemma 3.6: Smoothness descent
  • Theorem 3.7: Expected stationarity for RMS-calibrated, damping-only updates
  • proof