Table of Contents
Fetching ...

Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning

Yechen Zhang, Shuhao Xing, Junhao Huang, Kai Lv, Yunhua Zhou, Xipeng Qiu, Qipeng Guo, Kai Chen

TL;DR

This work proposes Mousse, a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning, and formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region.

Abstract

Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions. We argue that this "egalitarian" constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions. In this work, we propose \textbf{Mousse} (\textbf{M}uon \textbf{O}ptimization \textbf{U}tilizing \textbf{S}hampoo's \textbf{S}tructural \textbf{E}stimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning. Instead of applying Newton-Schulz orthogonalization directly to the momentum matrix, Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics (derived from Shampoo). Mathematically, we formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region, where the optimal update is derived via the polar decomposition of the whitened gradient. Empirical results across language models ranging from 160M to 800M parameters demonstrate that Mousse consistently outperforms Muon, achieving around $\sim$12\% reduction in training steps with negligible computational overhead.

Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning

TL;DR

This work proposes Mousse, a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning, and formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region.

Abstract

Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions. We argue that this "egalitarian" constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions. In this work, we propose \textbf{Mousse} (\textbf{M}uon \textbf{O}ptimization \textbf{U}tilizing \textbf{S}hampoo's \textbf{S}tructural \textbf{E}stimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning. Instead of applying Newton-Schulz orthogonalization directly to the momentum matrix, Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics (derived from Shampoo). Mathematically, we formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region, where the optimal update is derived via the polar decomposition of the whitened gradient. Empirical results across language models ranging from 160M to 800M parameters demonstrate that Mousse consistently outperforms Muon, achieving around 12\% reduction in training steps with negligible computational overhead.
Paper Structure (43 sections, 17 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 43 sections, 17 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: The optimal result of Muon and Mousse optimizers on 800M models. Mousse achieves a $\sim$12% reduction in training steps to reach comparable loss levels against Muon.
  • Figure 2: Overview of the Mousse framework compared to baselines.
  • Figure 3: Validation Loss comparison on FineWeb (20B tokens). We report the final validation loss of Mousse against AdamW, Muon, and SOAP across varying peak learning rates for model sizes ranging from 160M to 800M. Mousse consistently achieves the lowest validation loss across all model scales, demonstrating superior performance and scalability.
  • Figure 4: Mousse's performance gains over Muon across 160M, 240M, 480M, and 800M models.
  • Figure 5: Scalability analysis of validation loss and training efficiency. Solid lines represent validation loss (left y-axis, lower is better), while dashed lines indicate total training time (right y-axis, lower is better) across model sizes ranging from 160M to 800M. Mousse consistently achieves the lowest validation loss across all scales. Crucially, Mousse maintains a training speed nearly identical to the efficient Muon optimizer.
  • ...and 5 more figures