Table of Contents
Fetching ...

Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits

Abdurakhmon Sadiev, Peter Richtárik, Ilyas Fatkhullin

TL;DR

This work initiates a theoretical framework for second-order stochastic optimization under heavy-tailed noise modeled by $p$-BCM, deriving minimax lower bounds that reveal intrinsic sample-complexity limits for SOSO. It then introduces a near-optimal second-order method (NSGD_SOM) that attains these limits in the regime $L\le\sigma_h$ without requiring Hessian Lipschitzness, and a high-probability Hessian-clipping variant (Clip NSGD_SOM) that provides robust convergence guarantees with poly-log overhead. The core mechanism to achieve robustness is clipping not only gradients but also Hessian-vector products, enabling high-probability convergence in the presence of heavy-tailed noise in both first- and second-order oracles. Collectively, the results deliver the first comprehensive sample-complexity characterization for SOSO under heavy-tailed noise and position Hessian clipping as a principled, robust design principle for second-order algorithms in such regimes, with empirical validation on synthetic heavy-tailed tasks.

Abstract

Heavy-tailed noise is pervasive in modern machine learning applications, arising from data heterogeneity, outliers, and non-stationary stochastic environments. While second-order methods can significantly accelerate convergence in light-tailed or bounded-noise settings, such algorithms are often brittle and lack guarantees under heavy-tailed noise -- precisely the regimes where robustness is most critical. In this work, we take a first step toward a theoretical understanding of second-order optimization under heavy-tailed noise. We consider a setting where stochastic gradients and Hessians have only bounded $p$-th moments, for some $p\in (1,2]$, and establish tight lower bounds on the sample complexity of any second-order method. We then develop a variant of normalized stochastic gradient descent that leverages second-order information and provably matches these lower bounds. To address the instability caused by large deviations, we introduce a novel algorithm based on gradient and Hessian clipping, and prove high-probability upper bounds that nearly match the fundamental limits. Our results provide the first comprehensive sample complexity characterization for second-order optimization under heavy-tailed noise. This positions Hessian clipping as a robust and theoretically sound strategy for second-order algorithm design in heavy-tailed regimes.

Second-order Optimization under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity Limits

TL;DR

This work initiates a theoretical framework for second-order stochastic optimization under heavy-tailed noise modeled by -BCM, deriving minimax lower bounds that reveal intrinsic sample-complexity limits for SOSO. It then introduces a near-optimal second-order method (NSGD_SOM) that attains these limits in the regime without requiring Hessian Lipschitzness, and a high-probability Hessian-clipping variant (Clip NSGD_SOM) that provides robust convergence guarantees with poly-log overhead. The core mechanism to achieve robustness is clipping not only gradients but also Hessian-vector products, enabling high-probability convergence in the presence of heavy-tailed noise in both first- and second-order oracles. Collectively, the results deliver the first comprehensive sample-complexity characterization for SOSO under heavy-tailed noise and position Hessian clipping as a principled, robust design principle for second-order algorithms in such regimes, with empirical validation on synthetic heavy-tailed tasks.

Abstract

Heavy-tailed noise is pervasive in modern machine learning applications, arising from data heterogeneity, outliers, and non-stationary stochastic environments. While second-order methods can significantly accelerate convergence in light-tailed or bounded-noise settings, such algorithms are often brittle and lack guarantees under heavy-tailed noise -- precisely the regimes where robustness is most critical. In this work, we take a first step toward a theoretical understanding of second-order optimization under heavy-tailed noise. We consider a setting where stochastic gradients and Hessians have only bounded -th moments, for some , and establish tight lower bounds on the sample complexity of any second-order method. We then develop a variant of normalized stochastic gradient descent that leverages second-order information and provably matches these lower bounds. To address the instability caused by large deviations, we introduce a novel algorithm based on gradient and Hessian clipping, and prove high-probability upper bounds that nearly match the fundamental limits. Our results provide the first comprehensive sample complexity characterization for second-order optimization under heavy-tailed noise. This positions Hessian clipping as a robust and theoretically sound strategy for second-order algorithm design in heavy-tailed regimes.

Paper Structure

This paper contains 29 sections, 19 theorems, 158 equations, 5 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Let $q\in \mathbb{N}_{\geq 1}$, and let $\Delta >0, L_{1:q}\overset{\text{def}}{=} (L_1,\dots, L_q)$, $\sigma_{1:q} \overset{\text{def}}{=} (\sigma_1,\dots, \sigma_q)$ and $\varepsilon \leq {\cal O}(\sigma_1)$. Then, there exists $F \in {\cal F}(\Delta, L_{1:q})$ and a corresponding noisy oracle $\t Moreover, this lower bounds is realized by a construction of dimension $\Theta\left(\frac{\Delta}{\

Figures (5)

  • Figure 1: Sample complexity comparison for FOSO and SOSO depending on the tail index $p$. Each line corresponds to the leading term in the sample complexity for each class of algorithms. These leading terms match in upper and lower bounds, so this characterization is exact. We establish the characterization along the entire green line complementing prior work arjevani2023lower for $p=2$.
  • Figure 2: Performance of algorithms on a simple problem, $F(x) = 0.5 \left\lVert x\right\rVert^2$, $d = 10$ with synthetic noise generated from a two-sided Pareto distribution with tail index $p=1.1$. We observe that algorithms without clipping, NSGDM and NSGDHess, suffer significantly from noise. This motivates our more in-depth study involving gradient and Hessian clipping for high probability convergence.
  • Figure 3: Effect of Hessian Clipping Level $\lambda_h = \lambda$ on the Iteration Complexity. The plot shows the number of iterations required for Clip NSGDHess to find a point with $\left\lVert\nabla F(x)\right\rVert \leq 3/2$. For extremely small and large values of $\lambda,$ more iterations are needed. The recommended value for this task is $\lambda_h = 10.$
  • Figure 4: Number of iterations needed for Clip NSGDMHess and Clip NSGDM under Varying Tail Index to find a point with $\left\lVert\nabla F(x)\right\rVert \leq 3/2$ starting with the same initial point. The performance of both algorithms decreases gradually with the decrease of the tail index. The iteration complexity of the second-order algorithm, Clip NSGDMHess is uniformly better for all values of $p \in [1.1, 2].$
  • Figure 5: Iteration complexity of Clip NSGDHess (Algorithm \ref{['alg:NSGD_SOM_clipped']}) depending on gradient clipping for the three different fixed values of Hessian clipping $\bar{\lambda}_h \in \{0.01, 1, 100\}$.

Theorems & Definitions (31)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Corollary 1
  • Lemma 1: Lemma 10 from Hubler2024clip_to_norm
  • Lemma 2: Lemma 10 from cutkosky2021high
  • Lemma 3: Bernstein inequality
  • Lemma 4: Lemma 5.1 from sadiev2023high
  • Definition 2
  • ...and 21 more