Table of Contents
Fetching ...

On the Last-Iterate Convergence of Shuffling Gradient Methods

Zijian Liu, Zhengyuan Zhou

TL;DR

This work proves the first last-iterate convergence rates for shuffling gradient methods with respect to the objective value with respect to strong convexity even without strong convexity.

Abstract

Shuffling gradient methods are widely used in modern machine learning tasks and include three popular implementations: Random Reshuffle (RR), Shuffle Once (SO), and Incremental Gradient (IG). Compared to the empirical success, the theoretical guarantee of shuffling gradient methods was not well-understood for a long time. Until recently, the convergence rates had just been established for the average iterate for convex functions and the last iterate for strongly convex problems (using squared distance as the metric). However, when using the function value gap as the convergence criterion, existing theories cannot interpret the good performance of the last iterate in different settings (e.g., constrained optimization). To bridge this gap between practice and theory, we prove the first last-iterate convergence rates for shuffling gradient methods with respect to the objective value even without strong convexity. Our new results either (nearly) match the existing last-iterate lower bounds or are as fast as the previous best upper bounds for the average iterate.

On the Last-Iterate Convergence of Shuffling Gradient Methods

TL;DR

This work proves the first last-iterate convergence rates for shuffling gradient methods with respect to the objective value with respect to strong convexity even without strong convexity.

Abstract

Shuffling gradient methods are widely used in modern machine learning tasks and include three popular implementations: Random Reshuffle (RR), Shuffle Once (SO), and Incremental Gradient (IG). Compared to the empirical success, the theoretical guarantee of shuffling gradient methods was not well-understood for a long time. Until recently, the convergence rates had just been established for the average iterate for convex functions and the last iterate for strongly convex problems (using squared distance as the metric). However, when using the function value gap as the convergence criterion, existing theories cannot interpret the good performance of the last iterate in different settings (e.g., constrained optimization). To bridge this gap between practice and theory, we prove the first last-iterate convergence rates for shuffling gradient methods with respect to the objective value even without strong convexity. Our new results either (nearly) match the existing last-iterate lower bounds or are as fast as the previous best upper bounds for the average iterate.
Paper Structure (20 sections, 20 theorems, 158 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 20 theorems, 158 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Lemma 3.6

Given a convex and differentiable function $g(\mathbf{x}):\mathbb{R}^{d}\to\mathbb{R}$ satisfying $\left\Vert \nabla g(\mathbf{x})-\nabla g(\mathbf{y})\right\Vert \leq L\left\Vert \mathbf{x}-\mathbf{y}\right\Vert$, $\forall\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}$ for some $L>0$, then $\forall\mathbf{

Figures (2)

  • Figure : Summary of our new upper bounds and the existing lower bounds for $L$-smooth $f_{i}(\mathbf{x})$ for large $K$. If no lower bound was established before in the case, we instead state the previous best-known rate. Here, $\sigma_{\mathrm{any}}^{2}\triangleq\frac{1}{n}\sum_{i=1}^{n}\left\Vert \nabla f_{i}(\mathbf{x}_{*})\right\Vert ^{2}$, $\sigma_{\mathrm{rand}}^{2}\triangleq\sigma_{\mathrm{any}}^{2}+n\left\Vert \nabla f(\mathbf{x}_{*})\right\Vert ^{2}$ and $D\triangleq\left\Vert \mathbf{x}_{*}-\mathbf{x}_{1}\right\Vert$. All rates use the function value gap as the convergence criterion. In the column of Type , Any means the rate holds for whatever permutation not limited to RR/SO/IG. Random refers to the uniformly sampled permutation but is not restricted to RR/SO (see Remark \ref{['rem:sample']} for a detailed explanation). Avg and Last in the Output column stand for the average iterate and the last iterate, respectively. In the last column, ✓ means $\psi(\mathbf{x})$ can be taken arbitrarily and ✗ implies $\psi(\mathbf{x})=0$.
  • Figure : Summary of our new upper bounds and the previous fastest rates for $G$-Lipschitz $f_{i}(\mathbf{x})$ for large $K$. The lower bound in this case has not been proved as far as we know. Here, $D\triangleq\left\Vert \mathbf{x}_{*}-\mathbf{x}_{1}\right\Vert$. All rates use the function value gap as the convergence criterion. In the column of Type , Any means the rate holds for whatever permutation not limited to RR/SO/IG. Avg and Last in the Output column stand for the average iterate and the last iterate, respectively. In the last column, ✓ means $\psi(\mathbf{x})$ can be taken arbitrarily and ✗ implies $\psi(\mathbf{x})=0$.

Theorems & Definitions (39)

  • Lemma 3.6
  • Example 4.1
  • Example 4.2
  • Example 4.3
  • Theorem 4.4
  • Remark 4.5
  • Theorem 4.6
  • Theorem 4.7
  • Remark 4.8
  • Theorem 4.9
  • ...and 29 more