Table of Contents
Fetching ...

Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes

Marcel Hudiani

TL;DR

This work establishes almost-sure and high-probability convergence rates for the last iterate of SGD and SHB under global convexity or non-convexity with $ abla F$ being $oldsymbol{( abla,L)}$-Hölder. It develops a Gronwall-based analysis that avoids Robbins–Siegmund, deriving rates like $ ext{min}_{s rianglelefteq t} orm{ abla F(w_s)}^2=o(t^{p-1})$ for non-convex objectives and $ ext{min}_{s rianglelefteq t}igl(F(w_s)-F_*igr)=o(t^{p-1})$ for convex objectives, with $oldsymbol{p} ext{ in }(1/(1+oldsymbol{ abla}),1)$. For SHB with momentum $eta ext{ in }(0,1)$, a rate of $F(w_{ au rianglelefteq t})-F_*=oigl(t^{ rac{2oldsymbol{ abla}}{1+oldsymbol{ abla}} ext{max}(p-1,1-(1+oldsymbol{ abla})p)-oldsymbol{ m oldsymbol{ om{}}}igr)$ appears, and a high-probability convex-case rate $F(w_{T+1})-F_*=Oigl(T^{ ext{max}(p-1,-2p+1)} ext{log}^2(T/oldsymbol{ riangle})igr)$ is shown. The analysis relies on a non-martingale Gronwall framework and an ABC condition to control gradient noise, yielding a unified treatment that covers both smoothness regimes and convex vs non-convex objectives with global guarantees. These results provide the best-available last-iterate convergence benchmarks in this setting without invoking stochastic approximation theorems.

Abstract

We study the convergence rate for the last iterate of stochastic gradient descent (SGD) and stochastic heavy ball (SHB) in the parametric setting when the objective function $F$ is globally convex or non-convex whose gradient is $γ$-Hölder. Using only discrete Gronwall's inequality without Robbins-Siegmund theorem, we recover results for both SGD and SHB: $\min_{s\leq t} \|\nabla F(w_s)\|^2 = o(t^{p-1})$ for non-convex objectives and $F(w_{τ\wedge t}) - F_* = o(t^{2γ/(1+γ) \cdot \max(p-1,-2p+1)-\eps})$ for $β\in (0, 1)$, $τ:= \inf \{ t > 0 : F(w_t) = F_*\}$, and $\min_{s \leq t} F(w_s) - F_* = o(t^{p-1})$ for convex objectives $F$ whose minimum is $F_*$. In addition, we proved that SHB with constant momentum parameter $β\in (0, 1)$ attains a convergence rate of $F(w_t) - F_* = O(t^{\max(p-1,-2p+1)} \log^2 \frac{t}δ)$ with probability at least $1-δ$ when $F$ is convex and $γ= 1$ and step size $α_t = Θ(t^{-p})$ with $p \in (\frac{1}{2}, 1)$.

Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes

TL;DR

This work establishes almost-sure and high-probability convergence rates for the last iterate of SGD and SHB under global convexity or non-convexity with being -Hölder. It develops a Gronwall-based analysis that avoids Robbins–Siegmund, deriving rates like for non-convex objectives and for convex objectives, with . For SHB with momentum , a rate of appears, and a high-probability convex-case rate is shown. The analysis relies on a non-martingale Gronwall framework and an ABC condition to control gradient noise, yielding a unified treatment that covers both smoothness regimes and convex vs non-convex objectives with global guarantees. These results provide the best-available last-iterate convergence benchmarks in this setting without invoking stochastic approximation theorems.

Abstract

We study the convergence rate for the last iterate of stochastic gradient descent (SGD) and stochastic heavy ball (SHB) in the parametric setting when the objective function is globally convex or non-convex whose gradient is -Hölder. Using only discrete Gronwall's inequality without Robbins-Siegmund theorem, we recover results for both SGD and SHB: for non-convex objectives and for , , and for convex objectives whose minimum is . In addition, we proved that SHB with constant momentum parameter attains a convergence rate of with probability at least when is convex and and step size with .

Paper Structure

This paper contains 18 sections, 28 theorems, 132 equations, 2 tables.

Key Result

Proposition 2.4

Let $X_t, Y_t, Z_t$ be non-negative for all $t \in \mathbb{N}_0$ and $a_t > 0$ be such that Then $Y_t \in \ell^{\infty}(\mathbb{N})$ and $X_t \in \ell^1(\mathbb{N})$.

Theorems & Definitions (60)

  • Proposition 2.4
  • Lemma 2.5
  • Remark 2.6
  • Theorem 2.7
  • Remark 2.8
  • Remark 2.9
  • Theorem 2.10
  • Remark 2.11
  • Proposition 3.1
  • proof
  • ...and 50 more