Stochastic Gradient Descent Revisited
Azar Louzi
TL;DR
This work revisits stochastic gradient descent for biased nonconvex stochastic optimization, establishing a comprehensive convergence theory under mild Hölder-smoothness and moment conditions. It proves almost-sure weak, function-value, and global convergence without assuming iterate boundedness, and derives a suite of convergence rates and iteration complexities under Łojasiewicz-type conditions, including high-probability bounds that depend on the exponent $\beta$, Hölder parameter $\alpha$, and learning-rate schedule $\gamma_n$. A novel drift–martingale decomposition together with a final-hitting-time concept enables precise rates across regimes, offering guidance on learning-rate design and implying robustness of SGD in nonconvex settings. These results extend the theoretical understanding of SGD beyond convex regimes and provide practical insights for tuning learning rates in deep learning-like nonconvex landscapes.
Abstract
Stochastic gradient descent (SGD) has been a go-to algorithm for nonconvex stochastic optimization problems arising in machine learning. Its theory however often requires a strong framework to guarantee convergence properties. We hereby present a full scope convergence study of biased nonconvex SGD, including weak convergence, function-value convergence and global convergence, and also provide subsequent convergence rates and complexities, all under relatively mild conditions in comparison with literature.
