Table of Contents
Fetching ...

Improved Last-Iterate Convergence of Shuffling Gradient Methods for Nonsmooth Convex Optimization

Zijian Liu, Zhengyuan Zhou

TL;DR

The paper advances the theory of shuffling gradient methods for nonsmooth convex optimization by establishing improved last-iterate convergence rates under Random Reshuffle and Single Shuffle. By modeling the index sequence as uniform randomness and introducing a new sufficient last-iterate lemma, it shows RR and SS can beat Proximal GD in key regimes, and it achieves a nearly optimal suffix-average rate under RR that matches known lower bounds. The results cover general convex and strongly convex settings, with explicit dependence on problem constants G_{f,1}, G_{f,2}, D_*, and horizon T, and reveal nuanced behavior of the SS scheme depending on the time horizon. These findings underscore the beneficial role of randomness in shuffling methods and close gaps between upper and lower bounds reported in prior work. The work thus provides both theoretical insights and practically relevant guidance for choosing shuffling strategies in large-scale finite-sum optimization with regularization.

Abstract

We study the convergence of the shuffling gradient method, a popular algorithm employed to minimize the finite-sum function with regularization, in which functions are passed to apply (Proximal) Gradient Descent (GD) one by one whose order is determined by a permutation on the indices of functions. In contrast to its easy implementation and effective performance in practice, the theoretical understanding remains limited. A recent advance by (Liu & Zhou, 2024b) establishes the first last-iterate convergence results under various settings, especially proving the optimal rates for smooth (strongly) convex optimization. However, their bounds for nonsmooth (strongly) convex functions are only as fast as Proximal GD. In this work, we provide the first improved last-iterate analysis for the nonsmooth case demonstrating that the widely used Random Reshuffle ($\textsf{RR}$) and Single Shuffle ($\textsf{SS}$) strategies are both provably faster than Proximal GD, reflecting the benefit of randomness. As an important implication, we give the first (nearly) optimal convergence result for the suffix average under the $\textsf{RR}$ sampling scheme in the general convex case, matching the lower bound shown by (Koren et al., 2022).

Improved Last-Iterate Convergence of Shuffling Gradient Methods for Nonsmooth Convex Optimization

TL;DR

The paper advances the theory of shuffling gradient methods for nonsmooth convex optimization by establishing improved last-iterate convergence rates under Random Reshuffle and Single Shuffle. By modeling the index sequence as uniform randomness and introducing a new sufficient last-iterate lemma, it shows RR and SS can beat Proximal GD in key regimes, and it achieves a nearly optimal suffix-average rate under RR that matches known lower bounds. The results cover general convex and strongly convex settings, with explicit dependence on problem constants G_{f,1}, G_{f,2}, D_*, and horizon T, and reveal nuanced behavior of the SS scheme depending on the time horizon. These findings underscore the beneficial role of randomness in shuffling methods and close gaps between upper and lower bounds reported in prior work. The work thus provides both theoretical insights and practically relevant guidance for choosing shuffling strategies in large-scale finite-sum optimization with regularization.

Abstract

We study the convergence of the shuffling gradient method, a popular algorithm employed to minimize the finite-sum function with regularization, in which functions are passed to apply (Proximal) Gradient Descent (GD) one by one whose order is determined by a permutation on the indices of functions. In contrast to its easy implementation and effective performance in practice, the theoretical understanding remains limited. A recent advance by (Liu & Zhou, 2024b) establishes the first last-iterate convergence results under various settings, especially proving the optimal rates for smooth (strongly) convex optimization. However, their bounds for nonsmooth (strongly) convex functions are only as fast as Proximal GD. In this work, we provide the first improved last-iterate analysis for the nonsmooth case demonstrating that the widely used Random Reshuffle () and Single Shuffle () strategies are both provably faster than Proximal GD, reflecting the benefit of randomness. As an important implication, we give the first (nearly) optimal convergence result for the suffix average under the sampling scheme in the general convex case, matching the lower bound shown by (Koren et al., 2022).

Paper Structure

This paper contains 22 sections, 26 theorems, 151 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Theorem 4.2

Under Assumptions assu:basic (with $\mu=0$) and assu:lip, suppose the $\textsf{RR}$ sampling scheme is employed with one of the following three stepsizes $\eta_{t},\forall t\in\left[T\right]$: Then Algorithm alg:Alg guarantees If additionally assuming $T=\Omega(n\log n)$, then the first stepsize choice achieves the following improved rate

Figures (1)

  • Figure : Summary of our new convergence rates and the best-known upper/lower bounds under different settings when $T=Kn$ where $K\in\mathbb{\mathbb{N}}$. All results use the function value gap as the convergence measurement. In the Shuffling column, $\textsf{ANY}$ means the rate in the same row holds for any type of shuffling scheme not limited to $\textsf{RR}/\textsf{SS}/\textsf{IG}$. In the Rate column, $D_{\star}\triangleq\left\Vert \boldsymbol{x}_{\star}-\boldsymbol{x}_{1}\right\Vert$ denotes the Euclidean distance (or any upper bound on it) from the optimal solution $\boldsymbol{x}_{\star}$ and the initial point $\boldsymbol{x}_{1}$. $\land$ and $\lor$ indicate $\min$ and $\max$ operations, respectively. In the last column, $\boldsymbol{x}_{Kn+1}^{\textsf{avg}}\triangleq\frac{1}{Kn}\sum_{t=1}^{Kn}\boldsymbol{x}_{t+1}$ and $\boldsymbol{x}_{Kn+1}^{\textsf{suffix}}\triangleq\frac{1}{n}\sum_{t=Kn-n+1}^{Kn}\boldsymbol{x}_{t+1}$ respectively refer to the average iterate and the suffix average of the last one epoch.

Theorems & Definitions (53)

  • Remark 3.1
  • Example 3.2
  • Example 3.3
  • Example 3.4
  • Example 3.5
  • Remark 4.1
  • Theorem 4.2
  • Corollary 4.3
  • proof
  • Theorem 4.4
  • ...and 43 more