High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

Liam Madden; Emiliano Dall'Anese; Stephen Becker

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

Liam Madden, Emiliano Dall'Anese, Stephen Becker

TL;DR

This work analyzes high-probability convergence of stochastic gradient descent in non-convex settings without assuming convexity. It develops two regimes: (i) PL-constrained optimization with norm-sub-Gaussian gradient noise and (ii) general non-convex optimization with norm sub-Weibull gradient noise, supported by a novel sub-Weibull martingale difference sequence self-normalized concentration inequality. A post-processing scheme is proposed to extract a single iterate with provable convergence guarantees, enabling results comparable to or better than modified SGD variants such as clipping or momentum in heavier-tailed noise settings. The neural network example demonstrates how heavier gradient-noise tails translate into heavier tails in the convergence error, validating the theory. Overall, the paper extends high-probability SGD guarantees to heavier-tailed noise regimes and provides new probabilistic tools for analyzing stochastic non-convex optimization.

Abstract

Stochastic gradient descent is one of the most common iterative algorithms used in machine learning and its convergence analysis is a rich area of research. Understanding its convergence properties can help inform what modifications of it to use in different settings. However, most theoretical results either assume convexity or only provide convergence results in mean. This paper, on the other hand, proves convergence bounds in high probability without assuming convexity. Assuming strong smoothness, we prove high probability convergence bounds in two settings: (1) assuming the Polyak-Łojasiewicz inequality and norm sub-Gaussian gradient noise and (2) assuming norm sub-Weibull gradient noise. In the second setting, as an intermediate step to proving convergence, we prove a sub-Weibull martingale difference sequence self-normalized concentration inequality of independent interest. It extends Freedman-type concentration beyond the sub-exponential threshold to heavier-tailed martingale difference sequences. We also provide a post-processing method that picks a single iterate with a provable convergence guarantee as opposed to the usual bound for the unknown best iterate. Our convergence result for sub-Weibull noise extends the regime where stochastic gradient descent has equal or better convergence guarantees than stochastic gradient descent with modifications such as clipping, momentum, and normalization.

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

TL;DR

Abstract

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (27)