Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

Zitao Song; Cedar Site Bai; Zhe Zhang; Brian Bullins; David F. Gleich

Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich

TL;DR

DeVA is presented, a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent and improving the blockwise smoothness, facilitating faster convergence.

Abstract

Adaptive methods like Adam have become the $\textit{de facto}$ standard for large-scale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term. This decoupling produces $\textbf{DeVA}$ ($\textbf{De}$coupled $\textbf{V}$ariance $\textbf{A}$daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state-of-the-art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6\%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence. Our implementation is available at https://github.com/Tsedao/Decoupled-Variance-Adaptation

Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

TL;DR

Abstract

Adaptive methods like Adam have become the

standard for large-scale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term. This decoupling produces

(

coupled

ariance

daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state-of-the-art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6\%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence. Our implementation is available at https://github.com/Tsedao/Decoupled-Variance-Adaptation

Paper Structure (38 sections, 10 theorems, 54 equations, 7 figures, 6 tables, 4 algorithms)

This paper contains 38 sections, 10 theorems, 54 equations, 7 figures, 6 tables, 4 algorithms.

Introduction
Organization.
Preliminaries
Notation.
Steepest Gradient Descent.
Matrix-based Descent.
Our Methods
Connection to Adam
Matrix Extension
Practical Implementation
Analysis
Convergence result for $\mathop{\mathrm{\text{DeVA}_{\ell_{\infty}}}}\nolimits$
Convergence result for $\mathop{\mathrm{\text{DeVA}_{S_{\infty}}}}\nolimits$
Experiment
Main Results
...and 23 more sections

Key Result

Theorem 3.1

Let $L \in \mathbb{R}^{n \times n}$ and $R \in \mathbb{R}^{m \times m}$ be Kronecker factors with eigendecompositions $L = Q_L^{} \Lambda_L^{} Q_L^T$ and $R = Q_R^{} \Lambda_R^{} Q_R^T$. Let $\sigma_i = \sqrt{\lambda_i}$ and $\sigma_j = \sqrt{\mu_j}$ denote the singular values corresponding to the e where $\widetilde{E} \in \mathbb{R}^{n \times m}$ is the spectral adaptation matrix with entries de

Figures (7)

Figure 1: NanoGPT pretraining on FineWeb at a uniform $0.001$ learning rate. Compared to non-adaptive Muon, the adaptive method SOAP and our method DeVA$_{S_{\infty}}$ achieve the target validation perplexity using $4.3\%$ and $6.6\%$ fewer tokens, respectively.
Figure 2: Trace Quadratic Function Optimization (median and 25%/75% quantiles over 100 seeds). (a)--(b): Training performance for various optimizers on 9-dimensional trace quadratic problems with homogeneous vs. heterogeneous Hessians (details see Appendix \ref{['appsec:trace']}). $\mathop{\mathrm{\text{DeVA}_{S_{\infty}}}}\nolimits$ significantly outperforms Muon in the heterogeneous setting. (c)--(d): Dynamic of weighted dual norm $\|H\|_{1,\Gamma}$ (see \ref{['defn:gamma_norm']}) for adaptive methods. Adaptive methods ($\mathop{\mathrm{\text{DeVA}_{\ell_{\infty}}}}\nolimits$, $\mathop{\mathrm{\text{DeVA}_{S_{\infty}}}}\nolimits$) effectively reduce $\|H\|_{1,\Gamma}$ as predicted by Theorems \ref{['thm:devalf']} and \ref{['thm:devasf']}, with a more pronounced reduction in the heterogeneous case (c) compared to the homogeneous case (d).
Figure 3:
Figure 4: ResNet-20 validation accuracy on CIFAR-10 after 40 epochs. Optimal learning rates cluster near $0.005$ for vector methods (left) and $0.01$ for matrix methods (right), where $\mathop{\mathrm{\text{DeVA}_{\ell_{\infty}}}}\nolimits$ and SOAP achieve peak performance, respectively.
Figure 5: Batch size sensitivity on NanoGPT (274M). The generalization performance of $\mathop{\mathrm{\text{DeVA}_{S_{\infty}}}}\nolimits$ remains robust across larger batch sizes. Furthermore, the performance gap narrows the difference in average runtime between Muon and $\mathop{\mathrm{\text{DeVA}_{S_{\infty}}}}\nolimits$ at higher batch scales.
...and 2 more figures

Theorems & Definitions (24)

Theorem 3.1: Coordinate-wise $\devasf$
Proposition 3.1
Corollary 3.2
Definition 4.1: $\gamma$-Weighted Dual Norm
Example 4.2: $\ell_\infty$-norm
Example 4.3: $S_\infty$-norm
Example 4.4: Preconditioned matrix seminorm veprikov2025preconditioned
Example 4.5: Nuclear rank of $S$ davis2025spectral
Remark 4.6
Remark 4.7
...and 14 more

Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

TL;DR

Abstract

Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (24)