Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization
Dmitry Kovalev
TL;DR
This work builds a theoretical foundation for optimization with gradient orthogonalization by reframing orthogonalized gradient updates as non-Euclidean trust-region methods using matrix spectral and related norms. It introduces a stochastic non-Euclidean trust-region gradient method with momentum that unifies Muon, normalized SGD, and signSGD, and it provides convergence guarantees across non-convex, star-convex, and second-order smooth regimes. The paper also analyzes weight decay and extrapolation variants, yielding improved iteration complexities (notably $1/ ext{ε}^3$ for star-convex and $1/ ext{ε}^{3.5}$ with extrapolation in certain settings) and offers explanations for Muon’s practical advantages over Orthogonal-SGDM and the role of weight decay in large language-model training. Together, these results illuminate when and why gradient orthogonalization helps in deep learning and connect theory with observed empirical benefits.
Abstract
Optimization with matrix gradient orthogonalization has recently demonstrated impressive results in the training of deep neural networks (Jordan et al., 2024; Liu et al., 2025). In this paper, we provide a theoretical analysis of this approach. In particular, we show that the orthogonalized gradient method can be seen as a first-order trust-region optimization method, where the trust-region is defined in terms of the matrix spectral norm. Motivated by this observation, we develop the stochastic non-Euclidean trust-region gradient method with momentum, which recovers the Muon optimizer (Jordan et al., 2024) as a special case, along with normalized SGD and signSGD with momentum (Cutkosky and Mehta, 2020; Sun et al., 2023). In addition, we prove state-of-the-art convergence results for the proposed algorithm in a range of scenarios, which involve arbitrary non-Euclidean norms, constrained and composite problems, and non-convex, star-convex, first- and second-order smooth functions. Finally, our theoretical findings provide an explanation for several practical observations, including the practical superiority of Muon compared to the Orthogonal-SGDM algorithm of Tuddenham et al. (2022) and the importance of weight decay in the training of large-scale language models.
