An Improved Analysis of the Clipped Stochastic subGradient Method under Heavy-Tailed Noise
Daniela Angela Parletta, Andrea Paudice, Saverio Salzo
TL;DR
This work develops a clipped stochastic subgradient framework (C-SsGM) for nonsmooth convex optimization under heavy-tailed noise with only the first $p$ moments ($p\in(1,2]$), addressing unbounded domains. It establishes the first convergence rates in expectation for the last iterate and provides near-optimal rates for the last-iterate and average-iterate analyses, including tightened high-probability bounds and epoch-based reductions to relate last-iterate errors to the known average-iterate results. The results extend to kernelized supervised learning, enabling kernel-based implementations with provable risk guarantees even under heavy tails, alongside preliminary numerical experiments validating robustness and performance gains. The paper also discusses practical parameter schedules for both finite-horizon and anytime settings, highlighting improvements over prior work and outlining open questions on removing remaining log factors and achieving a unified analysis for the last iterate. Overall, clipping significantly enhances robustness and convergence in stochastic subgradient methods under heavy-tailed noise, with strong implications for large-scale, nonsmooth optimization and kernel methods.
Abstract
In this paper, we provide novel optimal (or near optimal) convergence rates for a clipped version of the stochastic subgradient method. We consider nonsmooth convex problems over possibly unbounded domains, under heavy-tailed noise that possesses only the first $p$ moments for $p \in \left]1,2\right]$. For the last iterate, we establish convergence in expectation for the objective values with rates of order $(\log^{1/p} k)/k^{(p-1)/p}$ and $1/k^{(p-1)/p}$, for anytime and finite-horizon respectively. We also derive new convergence rates, in expectation and with high probability, for the objective values along the average iterates--improving existing results by a $\log^{(2p-1)/p} k$ factor. Those results are applied to the problem of supervised learning with kernels demonstrating the effectiveness of our theory. Finally, we give preliminary experiments.
