Table of Contents
Fetching ...

High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

Eduard Gorbunov, Marina Danilova, Innokentiy Shibaev, Pavel Dvurechensky, Alexander Gasnikov

TL;DR

This work establishes high-probability convergence guarantees for non-smooth convex stochastic optimization under heavy-tailed noise by leveraging gradient clipping and Hölder-continuous gradients. It introduces two clipping-based methods, clipped-SSTM and clipped-SGD, with carefully designed step-size and clipping rules that yield logarithmic dependence on the confidence parameter $\beta$ and near-optimal iteration/oracle complexity across regimes. The authors extend these results to strongly convex problems via restart schemes, and provide comprehensive proofs and concentration-based analyses. Empirical results on synthetic data and neural networks (including BERT and ResNet) corroborate the theoretical advantages, particularly in heavy-tailed noise settings, while maintaining competitive performance in lighter-tail regimes. These findings advance robust guarantees for non-smooth stochastic optimization and offer practical, scalable algorithms for real-world learning tasks with non-sub-Gaussian noise.

Abstract

Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with Hölder-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.

High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

TL;DR

This work establishes high-probability convergence guarantees for non-smooth convex stochastic optimization under heavy-tailed noise by leveraging gradient clipping and Hölder-continuous gradients. It introduces two clipping-based methods, clipped-SSTM and clipped-SGD, with carefully designed step-size and clipping rules that yield logarithmic dependence on the confidence parameter and near-optimal iteration/oracle complexity across regimes. The authors extend these results to strongly convex problems via restart schemes, and provide comprehensive proofs and concentration-based analyses. Empirical results on synthetic data and neural networks (including BERT and ResNet) corroborate the theoretical advantages, particularly in heavy-tailed noise settings, while maintaining competitive performance in lighter-tail regimes. These findings advance robust guarantees for non-smooth stochastic optimization and offer practical, scalable algorithms for real-world learning tasks with non-sub-Gaussian noise.

Abstract

Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with Hölder-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.

Paper Structure

This paper contains 48 sections, 17 theorems, 200 equations, 12 figures, 2 tables, 4 algorithms.

Key Result

Theorem 2.1

Assume that function $f$ is convex, its stochastic gradient and its gradient satisfy eq:bounded_variance_clipped_SSTM and eq:holder_def respectively with $\sigma > 0$, $\nu \in [0,1]$, $M_\nu > 0$ on $Q = B_{3R_0}(x^*) = \{x\in\mathbb R^n\mid \|x-x^*\|_2 \le 3R_0\}$, where $R_0 \ge \|x^0 - x^*\|_2$.

Figures (12)

  • Figure 1: Noise distribution of the stochastic gradients for synthetic dataset, depending on batch size and $p$ of the loss function (\ref{['eq:generalized_lin_reg']}). Red lines: Gaussian probability density functions with means and variances empirically estimated by the samples. The total number of batches for each graph is $5\cdot10^5$.
  • Figure 2: Results obtained for different $p$ by the best relative train loss achieved. To calculate relative loss, we use $f_p(x_{\text{pred}})/f_p(x_{\text{true}})$, where $f_p(x_{true})$ is non-zero because of the noise added to the train part of the dataset.
  • Figure 3: Noise distribution of the stochastic gradients for ResNet-18 on ImageNet-100 and BERT fine-tuning on the CoLA dataset before the training. Red lines: Gaussian probability density functions with means and variances empirically estimated by the samples. Batch count is the total number of samples used to build a histogram.
  • Figure 4: Train and validation loss + accuracy for different optimizers on both problems. Here, "batch count" denotes the total number of used stochastic gradients.
  • Figure 5: Results obtained for different $p$ by the lowest epoch when model achieved $\times 2$ from loss in $x_{\text{true}}$
  • ...and 7 more figures

Theorems & Definitions (28)

  • Definition 1.1
  • Theorem 2.1: Simplified version of Theorem \ref{['thm:main_result_clipped_SSTM']}
  • Theorem 2.2: Simplified version of Theorem \ref{['thm:main_result_clipped_SSTM_str_cvx']}
  • Theorem 3.1: Simplified version of Theorem \ref{['thm:main_result_clipped_SGD']}
  • Theorem 3.2: Simplified version of Theorem \ref{['thm:main_result_clipped_SGD_str_cvx']}
  • Lemma 4.1
  • proof
  • Lemma 4.2: Lemma F.5 from gorbunov2020clipped_sstm.
  • Theorem 4.1
  • proof
  • ...and 18 more