Table of Contents
Fetching ...

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

Shuo Xie, Tianhao Wang, Beining Wu, Zhiyuan Li

TL;DR

This work establishes a unified theoretical framework connecting adaptive optimizers and NSD through non-Euclidean geometry. By introducing adaptive smoothness $\\Lambda_{\\mathcal{H}}(f)$ and adaptive gradient variance $\\sigma_{\\mathcal{H}}$, it shows adaptive methods converge in nonconvex settings at a rate governed by these stronger, geometry-aware notions, and yet can achieve accelerated $O(1/T^2)$ rates in convex settings when coupled with Nesterov momentum. A novel matrix-inequality tool enables handling general well-structured preconditioner sets beyond diagonal cases, yielding dimension-free stochastic rates under adaptive noise and highlighting a fundamental gap relative to standard smoothness/variance. The results unify AdaGrad, Adam, Shampoo-like methods under a single framework and quantify the benefits of adaptive geometry for both deterministic and stochastic optimization on Euclidean and non-Euclidean geometries. Overall, the paper deepens our understanding of adaptivity in optimization and its interaction with geometry, providing rigorous guarantees and guidance for designing geometry-aware optimizers.

Abstract

Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by introducing adaptive gradient variance, which parallels adaptive smoothness and leads to dimension-free convergence guarantees that cannot be achieved under standard gradient variance for certain non-Euclidean geometry.

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

TL;DR

This work establishes a unified theoretical framework connecting adaptive optimizers and NSD through non-Euclidean geometry. By introducing adaptive smoothness and adaptive gradient variance , it shows adaptive methods converge in nonconvex settings at a rate governed by these stronger, geometry-aware notions, and yet can achieve accelerated rates in convex settings when coupled with Nesterov momentum. A novel matrix-inequality tool enables handling general well-structured preconditioner sets beyond diagonal cases, yielding dimension-free stochastic rates under adaptive noise and highlighting a fundamental gap relative to standard smoothness/variance. The results unify AdaGrad, Adam, Shampoo-like methods under a single framework and quantify the benefits of adaptive geometry for both deterministic and stochastic optimization on Euclidean and non-Euclidean geometries. Overall, the paper deepens our understanding of adaptivity in optimization and its interaction with geometry, providing rigorous guarantees and guidance for designing geometry-aware optimizers.

Abstract

Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by introducing adaptive gradient variance, which parallels adaptive smoothness and leads to dimension-free convergence guarantees that cannot be achieved under standard gradient variance for certain non-Euclidean geometry.

Paper Structure

This paper contains 34 sections, 42 theorems, 188 equations, 1 figure, 8 algorithms.

Key Result

Lemma 2.2

Let $\cH\subseteq{\mathcal{S}}_+^d$ be any well-structured preconditioner set. Recall that its induced norm is defined as $\|\cdot\|_\cH = \sup_{\bH\in\cH,\mathop{\mathrm{Tr}}\nolimits(\bH)\leq 1}\|\cdot\|_\bH$. Then it holds that

Figures (1)

  • Figure 1: Here we demonstrate the duality between the supremum of the primal norms and the infimum of the corresponding dual norms for any well-structured preconditioner set $\cH$. In particular, we consider $\cH=\{\text{all diagonal PSD matrices}\}$, in which case $\left\|\cdot\right\|_\cH=\left\|\cdot\right\|_\infty$ and $\left\|\cdot\right\|_{\cH,*}=\left\|\cdot\right\|_1$. Left figure: the $\|\cdot\|_\infty$-unit ball (black square) in the primal space is the intersection of all $\|\cdot\|_\bH$-unit ball (colored ellipses) for $\bH\in\cH$ with $\mathop{\mathrm{Tr}}\nolimits(\bH)\leq 1$, that is, $\|\cdot\|_\infty$ is the supremum of all such primal $\|\cdot\|_\bH$ norms. Right figure: the $\|\cdot\|_1$-unit ball (dashed black square) in the dual space is the union of all $\|\cdot\|_{\bH,*}$-unit balls (dashed ellipses) for $\bH\in\cH$ with $\mathop{\mathrm{Tr}}\nolimits(\bH)\leq 1$, that is, $\|\cdot\|_1$ is the infimum of all such dual $\|\cdot\|_{\bH,*}$ norms.

Theorems & Definitions (77)

  • Definition 2.1: Well-structured preconditioner set
  • Lemma 2.2
  • Definition 2.3
  • Definition 2.4: Adaptive Smoothness, xie2025structured
  • Proposition 2.4
  • Theorem 3.1
  • Theorem 3.2
  • Lemma 3.2
  • Lemma 3.2
  • Definition 4.1: Standard and adaptive gradient variance
  • ...and 67 more