Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

Zijian Liu; Zhengyuan Zhou

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

Zijian Liu, Zhengyuan Zhou

TL;DR

This work delivers a unified analysis framework for the last-iterate convergence of stochastic gradient methods in constrained composite optimization, removing restrictive assumptions such as compact domains and bounded noise. Centered on the Composite Stochastic Mirror Descent (CSMD) algorithm, it provides the first high-probability last-iterate bounds over general domains, non-Euclidean norms, and composite objectives, and extends to smooth and strongly convex settings with adaptive step schedules. The authors further extend the theory to heavy-tailed and sub-Weibull noise, obtaining near-optimal or optimal rates under moment or tail conditions and maintaining a simple, adaptable proof structure. The results unify diverse scenarios under a single analytic umbrella, guiding robust and geometrically flexible SGD-like methods for practical large-scale problems.

Abstract

In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(\log(1/δ)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/δ)/T})$ high-probability convergence rates for the final iterate, where T is the time horizon and δis the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noise. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noise.

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

TL;DR

Abstract

high-probability convergence rates for the final iterate, where T is the time horizon and δis the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noise. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noise.

Paper Structure (39 sections, 35 theorems, 244 equations, 1 algorithm)

This paper contains 39 sections, 35 theorems, 244 equations, 1 algorithm.

Introduction
Our Contributions
Related Work
Preliminaries
Convergence Criterion
Last-Iterate Convergence of Stochastic Gradient Methods
General Convex Functions
Optimal Rates via the Linear Decay Step Size
Strongly Convex Functions
Unified Theoretical Analysis
Last-Iterate Convergence under Heavy-Tailed Noise
Additional Related Work
New Assumption and Some Discussions
General Convex Functions under Heavy-Tailed Noise
Optimal Rate under Heavy-Tailed Noise
...and 24 more sections

Key Result

Lemma 2.1

Given a sigma algebra $\mathcal{F}$ and a random vector $Z\in\mathbb{R}^{d}$ that is $\mathcal{F}$-measurable, if $\xi\in\mathbb{R}^{d}$ is a random vector satisfying $\mathbb{\mathbb{E}}\left[\xi\mid\mathcal{F}\right]=0$ and $\mathbb{\mathbb{E}}\left[\exp\left(\lambda\left\Vert \xi\right\Vert _{*}^

Theorems & Definitions (62)

Lemma 2.1
Theorem 3.1
Remark 3.2
Theorem 3.3
Theorem 3.4
Theorem 3.5
Theorem 3.6
Theorem 3.7
Lemma 4.1
Lemma 4.2
...and 52 more

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

TL;DR

Abstract

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (62)