Table of Contents
Fetching ...

Handbook of Convergence Theorems for (Stochastic) Gradient Methods

Guillaume Garrigos, Robert M. Gower

TL;DR

This handbook compiles concise, copyable convergence proofs for gradient-based methods across smooth, convex, strongly convex, and Polyak-Łojasiewicz settings, including stochastic variants (SGD, minibatch SGD, momentum, subgradient, and SPS) and nonsmooth tools (proximal methods, subdifferentials). It introduces core concepts such as expected smoothness, variance transfer, and interpolation to unify and bound stochastic behavior, while providing explicit iteration/complexity rates under various step-size regimes. The work connects deterministic and stochastic theory through proximal/nonsmooth analysis and momentum techniques, delivering practical, modular proofs with bibliographic guidance to foundational sources. Collectively, the sections offer a comprehensive, modular reference for convergence proofs and rate guarantees in gradient-based optimization. The practical impact lies in a readily applicable reference workflow for proving convergence under common optimization structures and in understanding how problems like interpolation and PL conditions influence rates.

Abstract

This is a handbook of simple proofs of the convergence of gradient and stochastic gradient descent type methods. We consider functions that are Lipschitz, smooth, convex, strongly convex, and/or Polyak-Łojasiewicz functions. Our focus is on ``good proofs'' that are also simple. Each section can be consulted separately. We start with proofs of gradient descent, then on stochastic variants, including minibatching and momentum. Then move on to nonsmooth problems with the subgradient method, the proximal gradient descent and their stochastic variants. Our focus is on global convergence rates and complexity rates. Some slightly less common proofs found here include that of SGD (Stochastic gradient descent) with a proximal step, with momentum, and with mini-batching without replacement.

Handbook of Convergence Theorems for (Stochastic) Gradient Methods

TL;DR

This handbook compiles concise, copyable convergence proofs for gradient-based methods across smooth, convex, strongly convex, and Polyak-Łojasiewicz settings, including stochastic variants (SGD, minibatch SGD, momentum, subgradient, and SPS) and nonsmooth tools (proximal methods, subdifferentials). It introduces core concepts such as expected smoothness, variance transfer, and interpolation to unify and bound stochastic behavior, while providing explicit iteration/complexity rates under various step-size regimes. The work connects deterministic and stochastic theory through proximal/nonsmooth analysis and momentum techniques, delivering practical, modular proofs with bibliographic guidance to foundational sources. Collectively, the sections offer a comprehensive, modular reference for convergence proofs and rate guarantees in gradient-based optimization. The practical impact lies in a readily applicable reference workflow for proving convergence under common optimization structures and in understanding how problems like interpolation and PL conditions influence rates.

Abstract

This is a handbook of simple proofs of the convergence of gradient and stochastic gradient descent type methods. We consider functions that are Lipschitz, smooth, convex, strongly convex, and/or Polyak-Łojasiewicz functions. Our focus is on ``good proofs'' that are also simple. Each section can be consulted separately. We start with proofs of gradient descent, then on stochastic variants, including minibatching and momentum. Then move on to nonsmooth problems with the subgradient method, the proximal gradient descent and their stochastic variants. Our focus is on global convergence rates and complexity rates. Some slightly less common proofs found here include that of SGD (Stochastic gradient descent) with a proximal step, with momentum, and with mini-batching without replacement.
Paper Structure (68 sections, 99 theorems, 380 equations, 1 figure, 1 table, 10 algorithms)

This paper contains 68 sections, 99 theorems, 380 equations, 1 figure, 1 table, 10 algorithms.

Key Result

lemma 6

Let $\mathcal{F} : \mathbb{R}^d \to \mathbb{R}^p$ be differentiable, and $L>0$. Then $\mathcal{F}$ is $L$-Lipschitz if and only if

Figures (1)

  • Figure 1: Graph of a PŁ function $f: \mathbb{R}^2 \to \mathbb{R}$. Note that the function is not convex, but that the only critical points are the global minimizers (displayed as a white curve).

Theorems & Definitions (241)

  • definition 1: Jacobian
  • remark 2: Gradient
  • definition 3: Hessian
  • remark 4: Hessian and eigenvalues
  • definition 5
  • lemma 6
  • proof
  • definition 7
  • lemma 8
  • proof
  • ...and 231 more