Table of Contents
Fetching ...

Unified Optimal Analysis of the (Stochastic) Gradient Method

Sebastian U. Stich

TL;DR

The note develops a unified analysis of stochastic gradient methods under a relaxed smoothness and convexity framework by introducing an $(L,\sigma)$-smoothness and $\mu$-convexity setting. It derives a fundamental recursion for SGD and solves it to obtain convergence bounds that simultaneously capture exponential decay terms and variance-driven terms, for both averaged and last-iterate measures. The key contribution is a tight, information-efficient rate characterization showing that, with appropriate step-size schedules and averaging, SGD attains rates $O\left( L R^2 \exp\left(-\frac{\mu T}{4L}\right) + \frac{\sigma^2}{\mu T} \right)$ for function suboptimality and similar optimal rates for distance to the optimum, recovering linear convergence in deterministic/interpolation cases. This work unifies analyses of GD and SGD, matches the best known iteration complexities up to constants, and clarifies how averaging and step-size design influence convergence under stochastic noise.

Abstract

In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $μ$-convex functions under a (milder than standard) $L$-smoothness assumption. We show that for carefully chosen stepsizes SGD converges after $T$ iterations as $O\left( LR^2 \exp \bigl[-\fracμ{4L}T\bigr] + \frac{σ^2}{μT} \right)$ where $σ^2$ measures the variance in the stochastic noise. For deterministic gradient descent (GD) and SGD in the interpolation setting we have $σ^2 =0$ and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.

Unified Optimal Analysis of the (Stochastic) Gradient Method

TL;DR

The note develops a unified analysis of stochastic gradient methods under a relaxed smoothness and convexity framework by introducing an -smoothness and -convexity setting. It derives a fundamental recursion for SGD and solves it to obtain convergence bounds that simultaneously capture exponential decay terms and variance-driven terms, for both averaged and last-iterate measures. The key contribution is a tight, information-efficient rate characterization showing that, with appropriate step-size schedules and averaging, SGD attains rates for function suboptimality and similar optimal rates for distance to the optimum, recovering linear convergence in deterministic/interpolation cases. This work unifies analyses of GD and SGD, matches the best known iteration complexities up to constants, and clarifies how averaging and step-size design influence convergence under stochastic noise.

Abstract

In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on -convex functions under a (milder than standard) -smoothness assumption. We show that for carefully chosen stepsizes SGD converges after iterations as where measures the variance in the stochastic noise. For deterministic gradient descent (GD) and SGD in the interpolation setting we have and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.

Paper Structure

This paper contains 19 sections, 7 theorems, 35 equations.

Key Result

Lemma 1

For $\mathbf{x}_0 \in \mathbb{R}^d$, let $\{\mathbf{x}_t\}_{t \geq 0}$ denote the iterates of SGD eq:sgd generated on a function $f$ under Assumptions ass:1--ass:3 for stepsizes $\gamma_t \leq \frac{1}{2L}$, $\forall t \geq 0$. Then

Theorems & Definitions (14)

  • Lemma 1
  • proof
  • Lemma 2
  • Lemma 3
  • Lemma 4: Kgoogle:cofefe
  • Theorem 5
  • proof
  • proof : Proof of Lemma \ref{['lemma:1']}
  • Lemma 6
  • proof
  • ...and 4 more