Table of Contents
Fetching ...

Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees

Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, Soummya Kar

TL;DR

The paper tackles the problem of high-probability convergence of online SGD under symmetric heavy-tailed noise, introducing a unified nonlinear SGD framework with a black-box nonlinearity $\boldsymbol{\Psi}$ and a denoised map $\boldsymbol{\Phi}$ to enable robust optimization without moment assumptions. It proves that non-convex objectives achieve gradient-norm-squared convergence at $\widetilde{\mathcal{O}}(t^{-1/4})$, while strongly convex objectives admit last-iterate convergence at $\mathcal{O}(t^{-\zeta})$ with $\zeta\in(0,1)$ and weighted-average convergence at $\widetilde{\mathcal{O}}(t^{-1/4})$, along with neighborhood convergence under noise mixtures. The results apply to a broad class of nonlinearities beyond clipping and extend to mixtures of symmetric and non-symmetric noise, with rate exponents that are constant and independent of the noise moment parameter $p$. Empirical results corroborate the theory and show that component-wise nonlinearities can outperform joint clipping, underscoring the practical value of the general framework for online learning under heavy-tailed noise.

Abstract

We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-ζ})$, where $ζ\in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex and $p < 8/7$ for strongly convex costs. Experiments validate our theory, showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.

Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees

TL;DR

The paper tackles the problem of high-probability convergence of online SGD under symmetric heavy-tailed noise, introducing a unified nonlinear SGD framework with a black-box nonlinearity and a denoised map to enable robust optimization without moment assumptions. It proves that non-convex objectives achieve gradient-norm-squared convergence at , while strongly convex objectives admit last-iterate convergence at with and weighted-average convergence at , along with neighborhood convergence under noise mixtures. The results apply to a broad class of nonlinearities beyond clipping and extend to mixtures of symmetric and non-symmetric noise, with rate exponents that are constant and independent of the noise moment parameter . Empirical results corroborate the theory and show that component-wise nonlinearities can outperform joint clipping, underscoring the practical value of the general framework for online learning under heavy-tailed noise.

Abstract

We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate , while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate , where depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded -th moments, , we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as , our exponents are constant and strictly better whenever for non-convex and for strongly convex costs. Experiments validate our theory, showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.

Paper Structure

This paper contains 33 sections, 13 theorems, 81 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Lemma 3.1

Let Assumptions asmpt:nonlin and asmpt:noise hold. Then, the effective noise vectors $\{{\mathbf e^{(t)}}\}_{t \in \mathbb{N}}$ satisfy:

Figures (6)

  • Figure 1: Random projections of per-sample gradients across epochs.
  • Figure 2: Performance of sign, component-wise clipping and joint clipping. Left to right: MSE performance and high-probability performance for $\varepsilon = \{0.1, 0.01 \}$, respectively. We can see that both component-wise nonlinearities converge faster in the MSE sense and achieve exponential tail decay. Note that clipping does not achieve exponentially decaying tails in second and third figures, as it does not reach the required accuracy in the allocated number of iterations.
  • Figure 3: The distribution of gradient projections after training for 15 epochs, using 6 different projection matrices.
  • Figure 4: Comparisons of test accuracies and losses of SGD with different nonlinearities under Levy stable gradient noise.
  • Figure 5: MSE performance of nonlinear SGD methods, using step-size policy $\alpha_t = 1/(t+1)^\delta$, for different values of $\delta \in (2/3,1)$. Left to right: we choose the values $\delta \in \{17/24,3/4,7/8\}$, respectively. We can see that both component-wise nonlinearities converge faster in the MSE sense, independent of the step-size choice.
  • ...and 1 more figures

Theorems & Definitions (51)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Remark 7
  • Example 1
  • Example 2
  • Example 3
  • ...and 41 more