Table of Contents
Fetching ...

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

Liam Madden, Emiliano Dall'Anese, Stephen Becker

TL;DR

This work analyzes high-probability convergence of stochastic gradient descent in non-convex settings without assuming convexity. It develops two regimes: (i) PL-constrained optimization with norm-sub-Gaussian gradient noise and (ii) general non-convex optimization with norm sub-Weibull gradient noise, supported by a novel sub-Weibull martingale difference sequence self-normalized concentration inequality. A post-processing scheme is proposed to extract a single iterate with provable convergence guarantees, enabling results comparable to or better than modified SGD variants such as clipping or momentum in heavier-tailed noise settings. The neural network example demonstrates how heavier gradient-noise tails translate into heavier tails in the convergence error, validating the theory. Overall, the paper extends high-probability SGD guarantees to heavier-tailed noise regimes and provides new probabilistic tools for analyzing stochastic non-convex optimization.

Abstract

Stochastic gradient descent is one of the most common iterative algorithms used in machine learning and its convergence analysis is a rich area of research. Understanding its convergence properties can help inform what modifications of it to use in different settings. However, most theoretical results either assume convexity or only provide convergence results in mean. This paper, on the other hand, proves convergence bounds in high probability without assuming convexity. Assuming strong smoothness, we prove high probability convergence bounds in two settings: (1) assuming the Polyak-Łojasiewicz inequality and norm sub-Gaussian gradient noise and (2) assuming norm sub-Weibull gradient noise. In the second setting, as an intermediate step to proving convergence, we prove a sub-Weibull martingale difference sequence self-normalized concentration inequality of independent interest. It extends Freedman-type concentration beyond the sub-exponential threshold to heavier-tailed martingale difference sequences. We also provide a post-processing method that picks a single iterate with a provable convergence guarantee as opposed to the usual bound for the unknown best iterate. Our convergence result for sub-Weibull noise extends the regime where stochastic gradient descent has equal or better convergence guarantees than stochastic gradient descent with modifications such as clipping, momentum, and normalization.

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

TL;DR

This work analyzes high-probability convergence of stochastic gradient descent in non-convex settings without assuming convexity. It develops two regimes: (i) PL-constrained optimization with norm-sub-Gaussian gradient noise and (ii) general non-convex optimization with norm sub-Weibull gradient noise, supported by a novel sub-Weibull martingale difference sequence self-normalized concentration inequality. A post-processing scheme is proposed to extract a single iterate with provable convergence guarantees, enabling results comparable to or better than modified SGD variants such as clipping or momentum in heavier-tailed noise settings. The neural network example demonstrates how heavier gradient-noise tails translate into heavier tails in the convergence error, validating the theory. Overall, the paper extends high-probability SGD guarantees to heavier-tailed noise regimes and provides new probabilistic tools for analyzing stochastic non-convex optimization.

Abstract

Stochastic gradient descent is one of the most common iterative algorithms used in machine learning and its convergence analysis is a rich area of research. Understanding its convergence properties can help inform what modifications of it to use in different settings. However, most theoretical results either assume convexity or only provide convergence results in mean. This paper, on the other hand, proves convergence bounds in high probability without assuming convexity. Assuming strong smoothness, we prove high probability convergence bounds in two settings: (1) assuming the Polyak-Łojasiewicz inequality and norm sub-Gaussian gradient noise and (2) assuming norm sub-Weibull gradient noise. In the second setting, as an intermediate step to proving convergence, we prove a sub-Weibull martingale difference sequence self-normalized concentration inequality of independent interest. It extends Freedman-type concentration beyond the sub-exponential threshold to heavier-tailed martingale difference sequences. We also provide a post-processing method that picks a single iterate with a provable convergence guarantee as opposed to the usual bound for the unknown best iterate. Our convergence result for sub-Weibull noise extends the regime where stochastic gradient descent has equal or better convergence guarantees than stochastic gradient descent with modifications such as clipping, momentum, and normalization.

Paper Structure

This paper contains 15 sections, 19 theorems, 117 equations, 3 figures, 1 table.

Key Result

Lemma 1

Let $m,n,d\in\mathbb N$ and $a\in\mathbb R$. Let $\phi:\mathbb R\to\mathbb R$ be twice differentiable and assume $|\phi(x)|,|\phi'(x)|,|\phi"(x)|\le a~\forall x\in\mathbb R$. Let $X\in\mathbb R^{d\times n}$, $v\in\mathbb R^m$, and $y\in\mathbb R^n$. Define Then $f$ is Lipschitz continuous and strongly smooth.

Figures (3)

  • Figure 1: Empirical $1-\delta$ convergence error, averaged over 10000 runs. The dashed lines show the mean $\pm$ one standard deviation (computed over 5 blocks of 2000 runs each). The data are less reliable for small $\delta$.
  • Figure 2: Same data as Fig. \ref{['fig:together_ver2']} but each line series is normalized, and the x-axis is $\log(1/\delta)$and plotted on a logarithmic scale, so $\log(1/\delta)^a$ dependence shows us a straight line with slope $a$. The $\delta$ range is from $0.2$ (left side) to $0.01$ (right side), since any smaller $\delta$ has unreliable statistics. Lines of best fit using the exponents from Table \ref{['tab:slopes']} are shown (with arbitrary shifts for clarity).
  • Figure 3: Contour plot of the PŁ function counter-example to projected gradient flow

Theorems & Definitions (27)

  • Lemma 1
  • Lemma 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Lemma 6
  • Remark 7
  • Theorem 8
  • Theorem 9
  • Remark 10
  • ...and 17 more