Table of Contents
Fetching ...

Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates

Dmitry Kovalev, Ekaterina Borodich

TL;DR

The paper studies why non-Euclidean SGD variants like SignSGD and Muon perform well in training deep networks by developing a unified convergence analysis under structured non-Euclidean smoothness and gradient noise. The approach leverages a subspace $\mathcal{H}$ to define non-Euclidean norms $\mathcal{R}$ and $\mathcal{R}_*$ and analyzes momentum, extrapolation, and momentum variance reduction within a trust-region gradient framework, showing that structure exploitation can match or exceed Euclidean SGD rates. The results show that non-Euclidean SGD can exploit sparsity or low-rank structure in the operator triples $(\mathbf{L},\mathbf{M},\mathbf{T},\mathbf{\Sigma})$, and under convexity the rates can match AdaGrad-Norm and Shampoo. Overall, the work provides a theoretical justification for the practical success of memory-efficient non-Euclidean SGD and offers guidance for selecting structure-aware preconditioners in large-scale training.

Abstract

Recently, several instances of non-Euclidean SGD, including SignSGD, Lion, and Muon, have attracted significant interest from the optimization community due to their practical success in training deep neural networks. Consequently, a number of works have attempted to explain this success by developing theoretical convergence analyses. Unfortunately, these results cannot properly justify the superior performance of these methods, as they could not beat the convergence rate of vanilla Euclidean SGD. We resolve this important open problem by developing a new unified convergence analysis under the structured smoothness and gradient noise assumption. In particular, our results indicate that non-Euclidean SGD (i) can exploit the sparsity or low-rank structure of the upper bounds on the Hessian and gradient noise, (ii) can provably benefit from popular algorithmic tools such as extrapolation or momentum variance reduction, and (iii) can match the state-of-the-art convergence rates of adaptive and more complex optimization algorithms such as AdaGrad and Shampoo.

Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates

TL;DR

The paper studies why non-Euclidean SGD variants like SignSGD and Muon perform well in training deep networks by developing a unified convergence analysis under structured non-Euclidean smoothness and gradient noise. The approach leverages a subspace to define non-Euclidean norms and and analyzes momentum, extrapolation, and momentum variance reduction within a trust-region gradient framework, showing that structure exploitation can match or exceed Euclidean SGD rates. The results show that non-Euclidean SGD can exploit sparsity or low-rank structure in the operator triples , and under convexity the rates can match AdaGrad-Norm and Shampoo. Overall, the work provides a theoretical justification for the practical success of memory-efficient non-Euclidean SGD and offers guidance for selecting structure-aware preconditioners in large-scale training.

Abstract

Recently, several instances of non-Euclidean SGD, including SignSGD, Lion, and Muon, have attracted significant interest from the optimization community due to their practical success in training deep neural networks. Consequently, a number of works have attempted to explain this success by developing theoretical convergence analyses. Unfortunately, these results cannot properly justify the superior performance of these methods, as they could not beat the convergence rate of vanilla Euclidean SGD. We resolve this important open problem by developing a new unified convergence analysis under the structured smoothness and gradient noise assumption. In particular, our results indicate that non-Euclidean SGD (i) can exploit the sparsity or low-rank structure of the upper bounds on the Hessian and gradient noise, (ii) can provably benefit from popular algorithmic tools such as extrapolation or momentum variance reduction, and (iii) can match the state-of-the-art convergence rates of adaptive and more complex optimization algorithms such as AdaGrad and Shampoo.

Paper Structure

This paper contains 33 sections, 16 theorems, 68 equations, 3 tables, 1 algorithm.

Key Result

lemma 1

<lem:dist> $\mathcal{R}\brr{}$ and $\mathcal{R}_*\brr{}$ are norms, that is, they are subadditive, absolutely homogeneous, and positive-definite functions. Moreover, these norms are dual to each other:

Theorems & Definitions (16)

  • lemma 1
  • lemma 2
  • lemma 3
  • lemma 4
  • lemma 5
  • lemma 6
  • lemma 7
  • lemma 8
  • lemma 9
  • lemma 10
  • ...and 6 more