Table of Contents
Fetching ...

Tight Lower Bounds and Optimal Algorithms for Stochastic Nonconvex Optimization with Heavy-Tailed Noise

Adrien Fradin, Abdurakhmon Sadiev, Laurent Condat, Peter Richtárik

Abstract

We study stochastic nonconvex optimization under heavy-tailed noise. In this setting, the stochastic gradients only have bounded $p$-th central moment ($p$-BCM) for some $p \in (1,2]$. Building on the foundational work of Arjevani et al. (2022) in stochastic optimization, we establish tight sample complexity lower bounds for all first-order methods under \emph{relaxed} mean-squared smoothness ($q$-WAS) and $δ$-similarity ($(q, δ)$-S) assumptions, allowing any exponent $q \in [1,2]$ instead of the standard $q = 2$. These results substantially broaden the scope of existing lower bounds. To complement them, we show that Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR), a known algorithm, matches these bounds in expectation. Beyond expectation guarantees, we introduce a new algorithm, Double-Clipped NSGD-MVR, which allows the derivation of high-probability convergence rates under weaker assumptions than in previous works. Finally, for second-order methods with stochastic Hessians satisfying bounded $q$-th central moment assumptions for some exponent $q \in [1, 2]$ (allowing $q \neq p$), we establish sharper lower bounds than previous works while improving over Sadiev et al. (2025) (where only $p = q$ is considered) and yielding stronger convergence exponents. Together, these results provide a nearly complete complexity characterization of stochastic nonconvex optimization in heavy-tailed regimes.

Tight Lower Bounds and Optimal Algorithms for Stochastic Nonconvex Optimization with Heavy-Tailed Noise

Abstract

We study stochastic nonconvex optimization under heavy-tailed noise. In this setting, the stochastic gradients only have bounded -th central moment (-BCM) for some . Building on the foundational work of Arjevani et al. (2022) in stochastic optimization, we establish tight sample complexity lower bounds for all first-order methods under \emph{relaxed} mean-squared smoothness (-WAS) and -similarity (-S) assumptions, allowing any exponent instead of the standard . These results substantially broaden the scope of existing lower bounds. To complement them, we show that Normalized Stochastic Gradient Descent with Momentum Variance Reduction (NSGD-MVR), a known algorithm, matches these bounds in expectation. Beyond expectation guarantees, we introduce a new algorithm, Double-Clipped NSGD-MVR, which allows the derivation of high-probability convergence rates under weaker assumptions than in previous works. Finally, for second-order methods with stochastic Hessians satisfying bounded -th central moment assumptions for some exponent (allowing ), we establish sharper lower bounds than previous works while improving over Sadiev et al. (2025) (where only is considered) and yielding stronger convergence exponents. Together, these results provide a nearly complete complexity characterization of stochastic nonconvex optimization in heavy-tailed regimes.

Paper Structure

This paper contains 96 sections, 63 theorems, 433 equations, 2 figures, 1 table, 5 algorithms.

Key Result

Theorem 3.1

Given $\Delta, \bar{L} > 0$, $\sigma_1 \ge 0$ and $0 < \varepsilon \le c_1 \sqrt{\bar{L} \Delta}$ for some universal constant $c_1 > 0$. Then, for any algorithm $A \in \mathcal{A}_{\texttt{zr}}$, there exists a function $f \in \mathcal{F}\left( \Delta \right)$, an oracle and a distribution $(O, \mat

Figures (2)

  • Figure 1: Top: Convergence trajectories of $F(x_t)$ over $T=1000$ iterations for $p=1.1$. Solid/dashed lines represent the mean over 20 independent runs, and shaded regions denote the standard deviation. Bottom: Algorithm performance across different tail indices $p \in \{1.1, 1.5, 2.0\}$. Hyperparameters are scaled strictly according to their respective theoretical optimal rates.
  • Figure 2: Left: Theoretical complexity $\mathcal{O}(\varepsilon^{-c})$ vs. tail index $p$. The y-axis represents the exponent $c$ of $\varepsilon$. Right: Empirical iterations required to reach a target suboptimality $F(x_t) < 1.2$ across different tail indices $p$. The empirical performance perfectly mirrors the theoretical scaling, demonstrating that D-clip-NSGD-MVR ($2$-WAS) provides the most robust acceleration in severe heavy-tailed regimes ($p<2$).

Theorems & Definitions (124)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Remark 3.1
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Remark 5.1
  • Theorem 5.1
  • ...and 114 more