Convergence Rate of the Last Iterate of Stochastic Proximal Algorithms

Kevin Kurian Thomas Vaidyan; Michael P. Friedlander; Ahmet Alacaoglu

Convergence Rate of the Last Iterate of Stochastic Proximal Algorithms

Kevin Kurian Thomas Vaidyan, Michael P. Friedlander, Ahmet Alacaoglu

TL;DR

This work addresses the lack of theoretical guarantees for last-iterate convergence in proximal stochastic optimization with unbounded gradient variance. By leveraging co-coercivity and a novel last-iterate reduction, it proves a $\widetilde{O}(1/\sqrt{T})$ last-iterate rate for proximal SGD under componentwise smoothness and convexity, and extends the result to randomized incremental proximal methods with a Lipschitz assumption on the component regularizers. It also provides corollaries for projected SGD and stochastic proximal point, and demonstrates the practical viability via BlockProx extensions and numerical experiments showing the last iterate often outperforms averaging. The results apply to graph-guided regularizers common in multi-task and federated learning, offering nonasymptotic, data-independent rates without requiring bounded variance. Overall, the paper advances the understanding of last-iterate behavior in regularized stochastic optimization and broadens the toolkit for large-scale, graph-structured learning tasks.

Abstract

We analyze two classical algorithms for solving additively composite convex optimization problems where the objective is the sum of a smooth term and a nonsmooth regularizer: proximal stochastic gradient method for a single regularizer; and the randomized incremental proximal method, which uses the proximal operator of a randomly selected function when the regularizer is given as the sum of many nonsmooth functions. We focus on relaxing the bounded variance assumption that is common, yet stringent, for getting last iterate convergence rates. We prove the $\widetilde{O}(1/\sqrt{T})$ rate of convergence for the last iterate of both algorithms under componentwise convexity and smoothness, which is optimal up to log terms. Our results apply directly to graph-guided regularizers that arise in multi-task and federated learning, where the regularizer decomposes as a sum over edges of a collaboration graph.

Convergence Rate of the Last Iterate of Stochastic Proximal Algorithms

TL;DR

last-iterate rate for proximal SGD under componentwise smoothness and convexity, and extends the result to randomized incremental proximal methods with a Lipschitz assumption on the component regularizers. It also provides corollaries for projected SGD and stochastic proximal point, and demonstrates the practical viability via BlockProx extensions and numerical experiments showing the last iterate often outperforms averaging. The results apply to graph-guided regularizers common in multi-task and federated learning, offering nonasymptotic, data-independent rates without requiring bounded variance. Overall, the paper advances the understanding of last-iterate behavior in regularized stochastic optimization and broadens the toolkit for large-scale, graph-structured learning tasks.

Abstract

rate of convergence for the last iterate of both algorithms under componentwise convexity and smoothness, which is optimal up to log terms. Our results apply directly to graph-guided regularizers that arise in multi-task and federated learning, where the regularizer decomposes as a sum over edges of a collaboration graph.

Paper Structure (32 sections, 20 theorems, 160 equations, 1 figure, 1 table)

This paper contains 32 sections, 20 theorems, 160 equations, 1 figure, 1 table.

Introduction
Problem setting and main assumptions
Notation.
Variance control via co-coercivity
Related Work
SGD world
Proximal point world
Proximal SGD
Statement of the result
Proof setup
One-iteration analysis
Last-iterate reduction
Corollary for Projected SGD
Randomized Incremental Proximal Method
Statement of the result
...and 17 more sections

Key Result

Theorem 3.1

Let assumptions hold. In eq:intro-spgd, let $\tau = 1/(3 L \sqrt{T})$. Then

Figures (1)

Figure 1: Comparison of last and averaged iterates.

Theorems & Definitions (31)

Theorem 3.1: Last-iterate convergence
Lemma 3.2: Per-iteration descent
Lemma 3.3: Last iterate reduction
Corollary 3.4: Projected SGD
Theorem 4.1: Last-iterate convergence, incremental proximal
Lemma 4.2: Per-iteration descent, incremental proximal
Corollary 4.3: Stochastic proximal point
Theorem 4.4: Last-iterate convergence, BlockProx
Theorem A.1: Last-iterate convergence, polynomial step sizes
Corollary A.2: Best polynomial step size
...and 21 more

Convergence Rate of the Last Iterate of Stochastic Proximal Algorithms

TL;DR

Abstract

Convergence Rate of the Last Iterate of Stochastic Proximal Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (31)