Table of Contents
Fetching ...

Sharp High-Probability Rates for Nonlinear SGD under Heavy-Tailed Noise via Symmetrization

Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, Soummya Kar

TL;DR

This work addresses high-probability convergence for non-convex stochastic optimization under heavy-tailed noise within a broad nonlinear SGD framework. It introduces three algorithms—N-SGD, N-SGE, and N-MSGE—each enabling exponential-tailed, tight convergence rates despite non-symmetric noise and unbounded moments. A central novelty is the symmetrization approach and the denoised-nonlinearity analysis, which yield an optimal $\widetilde{\mathcal{O}}(t^{-1/2})$ rate for symmetric noise and preserve near-optimal oracle complexities for non-symmetric noise under mild assumptions. When the noise has a bounded $p$-th moment with $p\in(1,2]$, N-MSGE achieves $\mathcal{O}(\epsilon^{-(6p-4)/(p-1)})$ complexity, while N-SGE matches the $\mathcal{O}(\epsilon^{-4})$ target under broader conditions. Overall, the results advance understanding of nonlinear SGD under heavy-tailed noise and offer practical guidance on estimator choice and nonlinearity design for robust training in non-convex settings.

Abstract

We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate $\widetilde{\mathcal{O}}(t^{-1/2})$, for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order $p \in (1,2]$. Compared to works assuming noise with bounded $p$-th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when $p < 2$, while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.

Sharp High-Probability Rates for Nonlinear SGD under Heavy-Tailed Noise via Symmetrization

TL;DR

This work addresses high-probability convergence for non-convex stochastic optimization under heavy-tailed noise within a broad nonlinear SGD framework. It introduces three algorithms—N-SGD, N-SGE, and N-MSGE—each enabling exponential-tailed, tight convergence rates despite non-symmetric noise and unbounded moments. A central novelty is the symmetrization approach and the denoised-nonlinearity analysis, which yield an optimal rate for symmetric noise and preserve near-optimal oracle complexities for non-symmetric noise under mild assumptions. When the noise has a bounded -th moment with , N-MSGE achieves complexity, while N-SGE matches the target under broader conditions. Overall, the results advance understanding of nonlinear SGD under heavy-tailed noise and offer practical guidance on estimator choice and nonlinearity design for robust training in non-convex settings.

Abstract

We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate , for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order . Compared to works assuming noise with bounded -th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when , while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.

Paper Structure

This paper contains 30 sections, 20 theorems, 88 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Lemma 1

Let Assumptions asmpt:noise-state and asmpt:nonlin hold, with the stochastic noise vectors ${\mathbf z^{t}} = \nabla f({\mathbf x^{t}};\xi^t) - \nabla F({\mathbf x^{t}})$ being mutually IID and independent from the state for all $t \in \mathbb{N}$, i.e., $P_{{\mathbf x}} \equiv P$, then the denoised where $\gamma_1,\gamma_2 > 0$ are constants that depend on the noise, choice of nonlinearity and ot

Figures (1)

  • Figure 1: Non-smooth component-wise nonlinearities and their smoothed counterparts. Top row: sign, clipping and their smooth counterparts. Bottom row: vectors sampled from a ball of radius 2 in $\mathbb{R}^2$ and their normalized and smooth normalized versions.

Theorems & Definitions (42)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Remark 7
  • Remark 8
  • Lemma 1: Lemma 3.2 in armacki2023high
  • Example 1
  • ...and 32 more