Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes
Marcel Hudiani
TL;DR
This work establishes almost-sure and high-probability convergence rates for the last iterate of SGD and SHB under global convexity or non-convexity with $ abla F$ being $oldsymbol{( abla,L)}$-Hölder. It develops a Gronwall-based analysis that avoids Robbins–Siegmund, deriving rates like $ ext{min}_{s rianglelefteq t} orm{ abla F(w_s)}^2=o(t^{p-1})$ for non-convex objectives and $ ext{min}_{s rianglelefteq t}igl(F(w_s)-F_*igr)=o(t^{p-1})$ for convex objectives, with $oldsymbol{p} ext{ in }(1/(1+oldsymbol{ abla}),1)$. For SHB with momentum $eta ext{ in }(0,1)$, a rate of $F(w_{ au rianglelefteq t})-F_*=oigl(t^{rac{2oldsymbol{ abla}}{1+oldsymbol{ abla}} ext{max}(p-1,1-(1+oldsymbol{ abla})p)-oldsymbol{ m oldsymbol{ om{}}}igr)$ appears, and a high-probability convex-case rate $F(w_{T+1})-F_*=Oigl(T^{ ext{max}(p-1,-2p+1)} ext{log}^2(T/oldsymbol{ riangle})igr)$ is shown. The analysis relies on a non-martingale Gronwall framework and an ABC condition to control gradient noise, yielding a unified treatment that covers both smoothness regimes and convex vs non-convex objectives with global guarantees. These results provide the best-available last-iterate convergence benchmarks in this setting without invoking stochastic approximation theorems.
Abstract
We study the convergence rate for the last iterate of stochastic gradient descent (SGD) and stochastic heavy ball (SHB) in the parametric setting when the objective function $F$ is globally convex or non-convex whose gradient is $γ$-Hölder. Using only discrete Gronwall's inequality without Robbins-Siegmund theorem, we recover results for both SGD and SHB: $\min_{s\leq t} \|\nabla F(w_s)\|^2 = o(t^{p-1})$ for non-convex objectives and $F(w_{τ\wedge t}) - F_* = o(t^{2γ/(1+γ) \cdot \max(p-1,-2p+1)-\eps})$ for $β\in (0, 1)$, $τ:= \inf \{ t > 0 : F(w_t) = F_*\}$, and $\min_{s \leq t} F(w_s) - F_* = o(t^{p-1})$ for convex objectives $F$ whose minimum is $F_*$. In addition, we proved that SHB with constant momentum parameter $β\in (0, 1)$ attains a convergence rate of $F(w_t) - F_* = O(t^{\max(p-1,-2p+1)} \log^2 \frac{t}δ)$ with probability at least $1-δ$ when $F$ is convex and $γ= 1$ and step size $α_t = Θ(t^{-p})$ with $p \in (\frac{1}{2}, 1)$.
