Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

Nikita Puchkin; Eduard Gorbunov; Nikolay Kutuzov; Alexander Gasnikov

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

Nikita Puchkin, Eduard Gorbunov, Nikolay Kutuzov, Alexander Gasnikov

TL;DR

We address stochastic convex optimization under structured heavy-tailed noise by stabilizing gradients with the smoothed median of means, enabling high-probability convergence bounds for clipped-SGD and clipped-SSTM that exceed classical $\mathcal{O}(K^{-2(\alpha-1)/\alpha})$ rates. The analysis hinges on decomposing component noise densities into a symmetric part and a lighter antisymmetric part, yielding finite bias and variance bounds for SMoM estimators and translating these into improved rates for both clipped-SGD and its accelerated variant in convex and strongly convex regimes. The results demonstrate practical gains on heavy-tailed problems and point to extensions to non-convex, non-smooth, and distributed settings, with potential further log-factor improvements. Overall, the structured-noise model and SMoM gradient estimation provide a principled route to faster stochastic optimization under heavy-tailed perturbations.

Abstract

We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $\mathcal{O}(K^{-2(α- 1)/α})$, when the stochastic gradients have finite moments of order $α\in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup.

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

TL;DR

rates. The analysis hinges on decomposing component noise densities into a symmetric part and a lighter antisymmetric part, yielding finite bias and variance bounds for SMoM estimators and translating these into improved rates for both clipped-SGD and its accelerated variant in convex and strongly convex regimes. The results demonstrate practical gains on heavy-tailed problems and point to extensions to non-convex, non-smooth, and distributed settings, with potential further log-factor improvements. Overall, the structured-noise model and SMoM gradient estimation provide a principled route to faster stochastic optimization under heavy-tailed perturbations.

Abstract

We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than

, when the stochastic gradients have finite moments of order

. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup.

Paper Structure (37 sections, 27 theorems, 330 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 27 theorems, 330 equations, 3 figures, 1 table, 1 algorithm.

INTRODUCTION
Contribution.
Paper structure.
SETUP AND NOTATION
Notation.
Setup.
RELATED WORK
High-probability complexity bounds.
Other results under heavy-tailed noise.
Median estimates.
SMOOTHED MEDIAN OF MEANS AND ITS PROPERTIES
MAIN RESULTS FOR STOCHASTIC OPTIMIZATION
Convergence of clipped-SGD
Convergence of clipped-SSTM
NUMERICAL EXPERIMENTS
...and 22 more sections

Key Result

Proposition 4.1

Fix any $j \in \{1, \dots, d\}$ and assume that the marginal density of $\nu_j$ is symmetric, that is, $\mathsf p_j(u) = \mathsf p_j(-u)$ for all $u \in \mathbb{R}$. Suppose that there exist positive numbers $B_j$ and $\beta_j$, such that Let $\nu_{j, 1}, \dots, \nu_{j, (2m + 1)}$ be independent copies of $\nu_j$. If $m > 3 / \beta_j$, then $\mathbb{E} \, \mathtt{Med}(\nu_{j, 1}, \dots, \nu_{j, (

Figures (3)

Figure 1: Dependence of the mean error on the oracle calls number with a 95th and 5th percentile bounds.
Figure 2: Dependence of the mean error on the number of iterations with a standard deviation upper bound.
Figure 3: Dependence of the confidence interval width for the error of mini-batched SGD with clipped smoothed median of means on the number of iterations.

Theorems & Definitions (42)

Proposition 4.1
Definition 4.2
Lemma 4.3
Remark 4.4
Lemma 4.5
Remark 4.6
Theorem 5.2
Corollary 5.3: Symmetric noise
Corollary 5.4: General noise
Theorem 5.5
...and 32 more

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

TL;DR

Abstract

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (42)