Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

Aleksandar Armacki; Dragana Bajović; Dušan Jakovetić; Soummya Kar; Ali H. Sayed

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

Aleksandar Armacki, Dragana Bajović, Dušan Jakovetić, Soummya Kar, Ali H. Sayed

TL;DR

This work analyzes the long-term tail behavior of SGD-type methods in non-convex optimization using large deviations theory. It establishes sharp LDP upper bounds for vanilla SGD under bounded noise, yielding an exponential tail decay at rate $e^{-t/\log t}$, and extends to clipped-SGD under heavy-tailed noise with rates $e^{-t^{\beta_p}/\log t}$ for $p\in(1,2)$ and $e^{-t/\log^2 t}$ for $p=2$, with rate functions independent of the specific initial optimality gap. Matching finite-time lower bounds confirm these rates are tight up to poly-log factors, demonstrating significantly faster long-term tail decay than prior finite-time analyses. The results collectively provide stronger, rigorous guarantees for individual training runs in large-scale non-convex learning, including scenarios with clipping to handle heavy tails. Overall, the paper bridges finite-sample and asymptotic tail analysis, offering practical insights into the reliability of SGD-based optimization over millions of iterations.

Abstract

The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of SGD-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) SGD, for non-convex costs and bounded noise, with long-term decay at rate $e^{-t/\log(t)}$. Next, we relax the noise assumption by considering clipped SGD (c-SGD) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-t^{β_p}/\log(t)}$, where $β_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-t/\log^2(t)}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both SGD and c-SGD are tight, up to poly-logarithmic factors. Notably, our results demonstrate an order of magnitude faster long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{β_p/2}}$, $p \in (1,2]$, for SGD and c-SGD, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

TL;DR

, and extends to clipped-SGD under heavy-tailed noise with rates

for

and

for

, with rate functions independent of the specific initial optimality gap. Matching finite-time lower bounds confirm these rates are tight up to poly-log factors, demonstrating significantly faster long-term tail decay than prior finite-time analyses. The results collectively provide stronger, rigorous guarantees for individual training runs in large-scale non-convex learning, including scenarios with clipping to handle heavy tails. Overall, the paper bridges finite-sample and asymptotic tail analysis, offering practical insights into the reliability of SGD-based optimization over millions of iterations.

Abstract

. Next, we relax the noise assumption by considering clipped SGD (c-SGD) under heavy-tailed noise with bounded moment of order

, showing an upper bound with long-term decay at rate

, where

for

and

for

. Finally, we provide lower bounds on the tail decay, at rate

, showing that our rates for both SGD and c-SGD are tight, up to poly-logarithmic factors. Notably, our results demonstrate an order of magnitude faster long-term tail decay compared to existing work based on finite-time bounds, which show rates

and

, for SGD and c-SGD, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.

Paper Structure (39 sections, 13 theorems, 81 equations, 1 table, 1 algorithm)

This paper contains 39 sections, 13 theorems, 81 equations, 1 table, 1 algorithm.

Introduction
Contributions
Literature Review
High-probability guarantees.
Large deviations guarantees.
Technical challenges and novelty.
Paper organization.
Notation.
Preliminaries
The Oracle Model and SGD-based Methods
1. Batch (i.e., offline) learning:
2. Streaming (i.e., online) learning:
Large Deviations Principle: a Background
Main Results
Assumptions
...and 24 more sections

Key Result

Lemma 3.1

Let Assumption asmpt:noise-bounded hold. Then the following are true, for any $t \geq 1$.

Theorems & Definitions (18)

Definition 1
Lemma 3.1
Theorem 1
Corollary 1
Lemma 3.2
Theorem 2
Corollary 2
Theorem 3
Proposition 1
Lemma 3.1
...and 8 more

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

TL;DR

Abstract

Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (18)