Almost sure convergence rates of stochastic gradient methods under gradient domination

Simon Weissmann; Sara Klein; Waïss Azizian; Leif Döring

Almost sure convergence rates of stochastic gradient methods under gradient domination

Simon Weissmann, Sara Klein, Waïss Azizian, Leif Döring

TL;DR

This work analyzes almost-sure convergence rates for stochastic gradient methods under gradient domination, replacing strong convexity with PL-type conditions. By combining $L$-smoothness, an ABC-type second-moment bound, and Robbins–Siegmund-type super-martingale arguments, the authors establish last-iterate rates $f(X_n)-f^* \in o\left(n^{-\frac{1}{4\beta-1}+\epsilon}\right)$ for both SGD and SHB under global gradient domination with $\beta\in[\tfrac12,1]$, with rates closely matching expectation-based bounds. The study extends to local gradient domination, showing high-probability containment in dominated regions and giving almost-sure and in-expectation rates conditioned on these events; applications include neural network training with analytic activations and policy-gradient methods in RL. Overall, the results provide a unified, rate-aware framework for stochastic first-order methods in nonconvex settings where gradient domination is a realistic structural assumption, and they yield practical guidance for step-size design in NN and RL contexts.

Abstract

Stochastic gradient methods are among the most important algorithms in training machine learning problems. While classical assumptions such as strong convexity allow a simple analysis they are rarely satisfied in applications. In recent years, global and local gradient domination properties have shown to be a more realistic replacement of strong convexity. They were proved to hold in diverse settings such as (simple) policy gradient methods in reinforcement learning and training of deep neural networks with analytic activation functions. We prove almost sure convergence rates $f(X_n)-f^*\in o\big( n^{-\frac{1}{4β-1}+ε}\big)$ of the last iterate for stochastic gradient descent (with and without momentum) under global and local $β$-gradient domination assumptions. The almost sure rates get arbitrarily close to recent rates in expectation. Finally, we demonstrate how to apply our results to the training task in both supervised and reinforcement learning.

Almost sure convergence rates of stochastic gradient methods under gradient domination

TL;DR

This work analyzes almost-sure convergence rates for stochastic gradient methods under gradient domination, replacing strong convexity with PL-type conditions. By combining

-smoothness, an ABC-type second-moment bound, and Robbins–Siegmund-type super-martingale arguments, the authors establish last-iterate rates

for both SGD and SHB under global gradient domination with

, with rates closely matching expectation-based bounds. The study extends to local gradient domination, showing high-probability containment in dominated regions and giving almost-sure and in-expectation rates conditioned on these events; applications include neural network training with analytic activations and policy-gradient methods in RL. Overall, the results provide a unified, rate-aware framework for stochastic first-order methods in nonconvex settings where gradient domination is a realistic structural assumption, and they yield practical guidance for step-size design in NN and RL contexts.

Abstract

of the last iterate for stochastic gradient descent (with and without momentum) under global and local

-gradient domination assumptions. The almost sure rates get arbitrarily close to recent rates in expectation. Finally, we demonstrate how to apply our results to the training task in both supervised and reinforcement learning.

Paper Structure (19 sections, 21 theorems, 165 equations, 1 figure, 1 table)

This paper contains 19 sections, 21 theorems, 165 equations, 1 figure, 1 table.

Introduction
Literature Review and Classification of our Contribution
Mathematical Background - Optimization under Gradient Domination
Assumptions on the Stochastic First Order Oracle
Stochastic Gradient Methods
Preliminary Discussion on Super-Martingale Convergence Rates
Convergence for Global Gradient Domination Property
Convergence for Local Gradient Domination Property
Application in the training of neural networks
Application in Reinforcement Learning
Acknowledgements
Auxiliary Convergence Theorems
Numerical experiment - Toy example
Details of the implementation:
Proof of Lemma \ref{['lem:as_rate_extension_beta_allgemein']}
...and 4 more sections

Key Result

Lemma 3.1

Let $(Y_n)_{n\in\mathbb{N}}$ be a sequence of non-negative random variables on an underlying probability space $(\Omega,\mathcal{F},\mathbb{P})$ with natural filtration $(\mathcal{F}_n)_{n\in\mathbb{N}}$ and suppose there exists $\beta \in [\frac{1}{2},1]$, $c_1,c_3\geq0$ and $c_2>0$ such that for all $n\ge1$, where $\gamma_n = \Theta(\frac{1}{n^{\theta}})$ for some fixed $\theta \in \left(\frac{

Figures (1)

Figure 1: Pathwise error $(f_p(X_n))_{n=1,\dots,N}$ of SGD and SHB for various choices of $\beta\in\{0.5, 0.67, 0.83, 0.92\}$. For each setting we have simulated $100$ runs of length , $N=10^5$. The bold lines correspond to the average error of SGD (red) and SHB (blue), and the black dash-dotted line corresponds to the theoretical rate $n^{-\frac{1}{4\beta-1}}$.

Theorems & Definitions (47)

Definition 2.2
Remark 2.3
Example 2.5: Expected risk minimization
Lemma 3.1
Theorem 4.1
proof
Theorem 4.2
proof : Proof of \ref{['thm:SHB-global']}
Theorem 5.1
proof : Sketch of proof
...and 37 more

Almost sure convergence rates of stochastic gradient methods under gradient domination

TL;DR

Abstract

Almost sure convergence rates of stochastic gradient methods under gradient domination

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (47)