Unified Optimal Analysis of the (Stochastic) Gradient Method
Sebastian U. Stich
TL;DR
The note develops a unified analysis of stochastic gradient methods under a relaxed smoothness and convexity framework by introducing an $(L,\sigma)$-smoothness and $\mu$-convexity setting. It derives a fundamental recursion for SGD and solves it to obtain convergence bounds that simultaneously capture exponential decay terms and variance-driven terms, for both averaged and last-iterate measures. The key contribution is a tight, information-efficient rate characterization showing that, with appropriate step-size schedules and averaging, SGD attains rates $O\left( L R^2 \exp\left(-\frac{\mu T}{4L}\right) + \frac{\sigma^2}{\mu T} \right)$ for function suboptimality and similar optimal rates for distance to the optimum, recovering linear convergence in deterministic/interpolation cases. This work unifies analyses of GD and SGD, matches the best known iteration complexities up to constants, and clarifies how averaging and step-size design influence convergence under stochastic noise.
Abstract
In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $μ$-convex functions under a (milder than standard) $L$-smoothness assumption. We show that for carefully chosen stepsizes SGD converges after $T$ iterations as $O\left( LR^2 \exp \bigl[-\fracμ{4L}T\bigr] + \frac{σ^2}{μT} \right)$ where $σ^2$ measures the variance in the stochastic noise. For deterministic gradient descent (GD) and SGD in the interpolation setting we have $σ^2 =0$ and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.
