Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes

Marcel Hudiani

Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes

Marcel Hudiani

TL;DR

This work establishes almost-sure and high-probability convergence rates for the last iterate of SGD and SHB under global convexity or non-convexity with $ abla F$ being $oldsymbol{( abla,L)}$-Hölder. It develops a Gronwall-based analysis that avoids Robbins–Siegmund, deriving rates like $ ext{min}_{s rianglelefteq t} orm{ abla F(w_s)}^2=o(t^{p-1})$ for non-convex objectives and $ ext{min}_{s rianglelefteq t}igl(F(w_s)-F_*igr)=o(t^{p-1})$ for convex objectives, with $oldsymbol{p} ext{ in }(1/(1+oldsymbol{ abla}),1)$. For SHB with momentum $eta ext{ in }(0,1)$, a rate of $F(w_{ au rianglelefteq t})-F_*=oigl(t^{rac{2oldsymbol{ abla}}{1+oldsymbol{ abla}} ext{max}(p-1,1-(1+oldsymbol{ abla})p)-oldsymbol{ m oldsymbol{ om{}}}igr)$ appears, and a high-probability convex-case rate $F(w_{T+1})-F_*=Oigl(T^{ ext{max}(p-1,-2p+1)} ext{log}^2(T/oldsymbol{ riangle})igr)$ is shown. The analysis relies on a non-martingale Gronwall framework and an ABC condition to control gradient noise, yielding a unified treatment that covers both smoothness regimes and convex vs non-convex objectives with global guarantees. These results provide the best-available last-iterate convergence benchmarks in this setting without invoking stochastic approximation theorems.

Abstract

We study the convergence rate for the last iterate of stochastic gradient descent (SGD) and stochastic heavy ball (SHB) in the parametric setting when the objective function $F$ is globally convex or non-convex whose gradient is $γ$-Hölder. Using only discrete Gronwall's inequality without Robbins-Siegmund theorem, we recover results for both SGD and SHB: $\min_{s\leq t} \|\nabla F(w_s)\|^2 = o(t^{p-1})$ for non-convex objectives and $F(w_{τ\wedge t}) - F_* = o(t^{2γ/(1+γ) \cdot \max(p-1,-2p+1)-\eps})$ for $β\in (0, 1)$, $τ:= \inf \{ t > 0 : F(w_t) = F_*\}$, and $\min_{s \leq t} F(w_s) - F_* = o(t^{p-1})$ for convex objectives $F$ whose minimum is $F_*$. In addition, we proved that SHB with constant momentum parameter $β\in (0, 1)$ attains a convergence rate of $F(w_t) - F_* = O(t^{\max(p-1,-2p+1)} \log^2 \frac{t}δ)$ with probability at least $1-δ$ when $F$ is convex and $γ= 1$ and step size $α_t = Θ(t^{-p})$ with $p \in (\frac{1}{2}, 1)$.

Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes

TL;DR

This work establishes almost-sure and high-probability convergence rates for the last iterate of SGD and SHB under global convexity or non-convexity with

being

-Hölder. It develops a Gronwall-based analysis that avoids Robbins–Siegmund, deriving rates like

for non-convex objectives and

for convex objectives, with

. For SHB with momentum

, a rate of

appears, and a high-probability convex-case rate

is shown. The analysis relies on a non-martingale Gronwall framework and an ABC condition to control gradient noise, yielding a unified treatment that covers both smoothness regimes and convex vs non-convex objectives with global guarantees. These results provide the best-available last-iterate convergence benchmarks in this setting without invoking stochastic approximation theorems.

Abstract

We study the convergence rate for the last iterate of stochastic gradient descent (SGD) and stochastic heavy ball (SHB) in the parametric setting when the objective function

is globally convex or non-convex whose gradient is

-Hölder. Using only discrete Gronwall's inequality without Robbins-Siegmund theorem, we recover results for both SGD and SHB:

for non-convex objectives and

for

, and

for convex objectives

whose minimum is

. In addition, we proved that SHB with constant momentum parameter

attains a convergence rate of

with probability at least

when

is convex and

and step size

with

Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes

TL;DR

Abstract

Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (60)