Table of Contents
Fetching ...

Stochastic Gradient Descent Revisited

Azar Louzi

TL;DR

This work revisits stochastic gradient descent for biased nonconvex stochastic optimization, establishing a comprehensive convergence theory under mild Hölder-smoothness and moment conditions. It proves almost-sure weak, function-value, and global convergence without assuming iterate boundedness, and derives a suite of convergence rates and iteration complexities under Łojasiewicz-type conditions, including high-probability bounds that depend on the exponent $\beta$, Hölder parameter $\alpha$, and learning-rate schedule $\gamma_n$. A novel drift–martingale decomposition together with a final-hitting-time concept enables precise rates across regimes, offering guidance on learning-rate design and implying robustness of SGD in nonconvex settings. These results extend the theoretical understanding of SGD beyond convex regimes and provide practical insights for tuning learning rates in deep learning-like nonconvex landscapes.

Abstract

Stochastic gradient descent (SGD) has been a go-to algorithm for nonconvex stochastic optimization problems arising in machine learning. Its theory however often requires a strong framework to guarantee convergence properties. We hereby present a full scope convergence study of biased nonconvex SGD, including weak convergence, function-value convergence and global convergence, and also provide subsequent convergence rates and complexities, all under relatively mild conditions in comparison with literature.

Stochastic Gradient Descent Revisited

TL;DR

This work revisits stochastic gradient descent for biased nonconvex stochastic optimization, establishing a comprehensive convergence theory under mild Hölder-smoothness and moment conditions. It proves almost-sure weak, function-value, and global convergence without assuming iterate boundedness, and derives a suite of convergence rates and iteration complexities under Łojasiewicz-type conditions, including high-probability bounds that depend on the exponent , Hölder parameter , and learning-rate schedule . A novel drift–martingale decomposition together with a final-hitting-time concept enables precise rates across regimes, offering guidance on learning-rate design and implying robustness of SGD in nonconvex settings. These results extend the theoretical understanding of SGD beyond convex regimes and provide practical insights for tuning learning rates in deep learning-like nonconvex landscapes.

Abstract

Stochastic gradient descent (SGD) has been a go-to algorithm for nonconvex stochastic optimization problems arising in machine learning. Its theory however often requires a strong framework to guarantee convergence properties. We hereby present a full scope convergence study of biased nonconvex SGD, including weak convergence, function-value convergence and global convergence, and also provide subsequent convergence rates and complexities, all under relatively mild conditions in comparison with literature.

Paper Structure

This paper contains 9 sections, 23 theorems, 209 equations.

Key Result

Theorem 1.5

Under $\mathcal{H}$asp:f--asp:gamma, the function values $(F(\theta_n))_{n\geq0}$ converge $\mathop{\mathrm{\mathbb{P}\text{-as}}}\nolimits$ to a real-valued random variable $F_\star\in L^1(\mathbb{P})$ and $\nabla F(\theta_n)\to0$$\mathop{\mathrm{\mathbb{P}\text{-as}}}\nolimits$ as $n\to\infty$.

Theorems & Definitions (61)

  • Example 1.1
  • Example 1.2
  • Remark 1.3
  • Example 1.4
  • Theorem 1.5
  • Lemma 1.6
  • Theorem 1.7
  • Remark 1.8
  • Proposition 1.9: Loj65
  • Proposition 1.10: Kur98
  • ...and 51 more