Online Convex Optimization with Heavy Tails: Old Algorithms, New Regrets, and Applications
Zijian Liu
TL;DR
The paper addresses Online Convex Optimization with heavy-tailed stochastic gradients (p in (1,2]) under a bounded domain and proves that vanilla algorithms OG D, DA, and AdaGrad achieve the optimal in-expectation regret $\mathbb{E}[R_T(x)] \lesssim GD\sqrt{T}+\sigma DT^{1/\mathsf{p}}$ without gradient clipping. It extends these results to online strongly convex, nonsmooth convex, and nonsmooth nonconvex settings, yielding first-optimal convergence rates for nonsmooth convex optimization without clipping and new sample complexities with matching lower bounds for nonsmooth nonconvex optimization via an Online-To-Nonconvex (O2NC) reduction. The framework generalizes to smooth losses and optimistic algorithms, providing adaptive, parameter-free guarantees (notably for AdaGrad) and extending to Hölder smooth and Hölder nonconvex scenarios. These findings bridge theory/practice by explaining why classical OCO methods work under heavy tails and offer practical convergence guarantees with minimal algorithmic changes. The results have implications for stochastic optimization tasks where gradient estimates exhibit heavy-tailed behavior, including robust convergence guarantees without clipping in both convex and certain nonconvex regimes.
Abstract
In Online Convex Optimization (OCO), when the stochastic gradient has a finite variance, many algorithms provably work and guarantee a sublinear regret. However, limited results are known if the gradient estimate has a heavy tail, i.e., the stochastic gradient only admits a finite $\mathsf{p}$-th central moment for some $\mathsf{p}\in\left(1,2\right]$. Motivated by it, this work examines different old algorithms for OCO (e.g., Online Gradient Descent) in the more challenging heavy-tailed setting. Under the standard bounded domain assumption, we establish new regrets for these classical methods without any algorithmic modification. Remarkably, these regret bounds are fully optimal in all parameters (can be achieved even without knowing $\mathsf{p}$), suggesting that OCO with heavy tails can be solved effectively without any extra operation (e.g., gradient clipping). Our new results have several applications. A particularly interesting one is the first provable and optimal convergence result for nonsmooth nonconvex optimization under heavy-tailed noise without gradient clipping. Furthermore, we explore broader settings (e.g., smooth OCO) and extend our ideas to optimistic algorithms to handle different cases simultaneously.
