Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization
Aleksandar Armacki, Dragana Bajović, Dušan Jakovetić, Soummya Kar, Ali H. Sayed
TL;DR
This work analyzes the long-term tail behavior of SGD-type methods in non-convex optimization using large deviations theory. It establishes sharp LDP upper bounds for vanilla SGD under bounded noise, yielding an exponential tail decay at rate $e^{-t/\log t}$, and extends to clipped-SGD under heavy-tailed noise with rates $e^{-t^{\beta_p}/\log t}$ for $p\in(1,2)$ and $e^{-t/\log^2 t}$ for $p=2$, with rate functions independent of the specific initial optimality gap. Matching finite-time lower bounds confirm these rates are tight up to poly-log factors, demonstrating significantly faster long-term tail decay than prior finite-time analyses. The results collectively provide stronger, rigorous guarantees for individual training runs in large-scale non-convex learning, including scenarios with clipping to handle heavy tails. Overall, the paper bridges finite-sample and asymptotic tail analysis, offering practical insights into the reliability of SGD-based optimization over millions of iterations.
Abstract
The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of SGD-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) SGD, for non-convex costs and bounded noise, with long-term decay at rate $e^{-t/\log(t)}$. Next, we relax the noise assumption by considering clipped SGD (c-SGD) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-t^{β_p}/\log(t)}$, where $β_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-t/\log^2(t)}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both SGD and c-SGD are tight, up to poly-logarithmic factors. Notably, our results demonstrate an order of magnitude faster long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{β_p/2}}$, $p \in (1,2]$, for SGD and c-SGD, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.
