Almost sure convergence rates of stochastic gradient methods under gradient domination
Simon Weissmann, Sara Klein, Waïss Azizian, Leif Döring
TL;DR
This work analyzes almost-sure convergence rates for stochastic gradient methods under gradient domination, replacing strong convexity with PL-type conditions. By combining $L$-smoothness, an ABC-type second-moment bound, and Robbins–Siegmund-type super-martingale arguments, the authors establish last-iterate rates $f(X_n)-f^* \in o\left(n^{-\frac{1}{4\beta-1}+\epsilon}\right)$ for both SGD and SHB under global gradient domination with $\beta\in[\tfrac12,1]$, with rates closely matching expectation-based bounds. The study extends to local gradient domination, showing high-probability containment in dominated regions and giving almost-sure and in-expectation rates conditioned on these events; applications include neural network training with analytic activations and policy-gradient methods in RL. Overall, the results provide a unified, rate-aware framework for stochastic first-order methods in nonconvex settings where gradient domination is a realistic structural assumption, and they yield practical guidance for step-size design in NN and RL contexts.
Abstract
Stochastic gradient methods are among the most important algorithms in training machine learning problems. While classical assumptions such as strong convexity allow a simple analysis they are rarely satisfied in applications. In recent years, global and local gradient domination properties have shown to be a more realistic replacement of strong convexity. They were proved to hold in diverse settings such as (simple) policy gradient methods in reinforcement learning and training of deep neural networks with analytic activation functions. We prove almost sure convergence rates $f(X_n)-f^*\in o\big( n^{-\frac{1}{4β-1}+ε}\big)$ of the last iterate for stochastic gradient descent (with and without momentum) under global and local $β$-gradient domination assumptions. The almost sure rates get arbitrarily close to recent rates in expectation. Finally, we demonstrate how to apply our results to the training task in both supervised and reinforcement learning.
