Unified Optimal Analysis of the (Stochastic) Gradient Method

Sebastian U. Stich

Unified Optimal Analysis of the (Stochastic) Gradient Method

Sebastian U. Stich

TL;DR

The note develops a unified analysis of stochastic gradient methods under a relaxed smoothness and convexity framework by introducing an $(L,\sigma)$-smoothness and $\mu$-convexity setting. It derives a fundamental recursion for SGD and solves it to obtain convergence bounds that simultaneously capture exponential decay terms and variance-driven terms, for both averaged and last-iterate measures. The key contribution is a tight, information-efficient rate characterization showing that, with appropriate step-size schedules and averaging, SGD attains rates $O\left( L R^2 \exp\left(-\frac{\mu T}{4L}\right) + \frac{\sigma^2}{\mu T} \right)$ for function suboptimality and similar optimal rates for distance to the optimum, recovering linear convergence in deterministic/interpolation cases. This work unifies analyses of GD and SGD, matches the best known iteration complexities up to constants, and clarifies how averaging and step-size design influence convergence under stochastic noise.

Abstract

In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $μ$-convex functions under a (milder than standard) $L$-smoothness assumption. We show that for carefully chosen stepsizes SGD converges after $T$ iterations as $O\left( LR^2 \exp \bigl[-\fracμ{4L}T\bigr] + \frac{σ^2}{μT} \right)$ where $σ^2$ measures the variance in the stochastic noise. For deterministic gradient descent (GD) and SGD in the interpolation setting we have $σ^2 =0$ and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.

Unified Optimal Analysis of the (Stochastic) Gradient Method

TL;DR

The note develops a unified analysis of stochastic gradient methods under a relaxed smoothness and convexity framework by introducing an

-smoothness and

-convexity setting. It derives a fundamental recursion for SGD and solves it to obtain convergence bounds that simultaneously capture exponential decay terms and variance-driven terms, for both averaged and last-iterate measures. The key contribution is a tight, information-efficient rate characterization showing that, with appropriate step-size schedules and averaging, SGD attains rates

for function suboptimality and similar optimal rates for distance to the optimum, recovering linear convergence in deterministic/interpolation cases. This work unifies analyses of GD and SGD, matches the best known iteration complexities up to constants, and clarifies how averaging and step-size design influence convergence under stochastic noise.

Abstract

In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on

-convex functions under a (milder than standard)

-smoothness assumption. We show that for carefully chosen stepsizes SGD converges after

iterations as

where

measures the variance in the stochastic noise. For deterministic gradient descent (GD) and SGD in the interpolation setting we have

and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.

Unified Optimal Analysis of the (Stochastic) Gradient Method

TL;DR

Abstract

Unified Optimal Analysis of the (Stochastic) Gradient Method

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (14)