Sharp High-Probability Rates for Nonlinear SGD under Heavy-Tailed Noise via Symmetrization
Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, Soummya Kar
TL;DR
This work addresses high-probability convergence for non-convex stochastic optimization under heavy-tailed noise within a broad nonlinear SGD framework. It introduces three algorithms—N-SGD, N-SGE, and N-MSGE—each enabling exponential-tailed, tight convergence rates despite non-symmetric noise and unbounded moments. A central novelty is the symmetrization approach and the denoised-nonlinearity analysis, which yield an optimal $\widetilde{\mathcal{O}}(t^{-1/2})$ rate for symmetric noise and preserve near-optimal oracle complexities for non-symmetric noise under mild assumptions. When the noise has a bounded $p$-th moment with $p\in(1,2]$, N-MSGE achieves $\mathcal{O}(\epsilon^{-(6p-4)/(p-1)})$ complexity, while N-SGE matches the $\mathcal{O}(\epsilon^{-4})$ target under broader conditions. Overall, the results advance understanding of nonlinear SGD under heavy-tailed noise and offer practical guidance on estimator choice and nonlinearity design for robust training in non-convex settings.
Abstract
We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate $\widetilde{\mathcal{O}}(t^{-1/2})$, for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order $p \in (1,2]$. Compared to works assuming noise with bounded $p$-th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when $p < 2$, while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.
