Table of Contents
Fetching ...

Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise

Thanh Dang, Melih Barsbey, A K M Rokonuzzaman Sonet, Mert Gurbuzbalaban, Umut Simsekli, Lingjiong Zhu

TL;DR

The paper addresses how heavy-tailed gradient noise interacts with momentum in stochastic optimization, focusing on SGD with momentum (SGDm). It models SGDm as an $\alpha$-stable Lévy-driven SDE and proves a $\mathcal{W}_1$ algorithmic stability bound, from which a generalization bound for Lipschitz surrogates follows; it also provides explicit results for quadratic losses showing momentum can worsen generalization. A novel uniform-in-time discretization bound connects the continuous SDE behavior to discrete-time SGDm, and the authors substantiate their theory with synthetic quadratic experiments and neural-network tests on MNIST and CIFAR-10. The findings indicate momentum may degrade generalization under heavy-tailed noise, guiding practical choices of momentum and step-size and suggesting directions for future work on the trade-offs between training speed and generalization in heavy-tailed regimes.

Abstract

Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.

Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise

TL;DR

The paper addresses how heavy-tailed gradient noise interacts with momentum in stochastic optimization, focusing on SGD with momentum (SGDm). It models SGDm as an -stable Lévy-driven SDE and proves a algorithmic stability bound, from which a generalization bound for Lipschitz surrogates follows; it also provides explicit results for quadratic losses showing momentum can worsen generalization. A novel uniform-in-time discretization bound connects the continuous SDE behavior to discrete-time SGDm, and the authors substantiate their theory with synthetic quadratic experiments and neural-network tests on MNIST and CIFAR-10. The findings indicate momentum may degrade generalization under heavy-tailed noise, guiding practical choices of momentum and step-size and suggesting directions for future work on the trade-offs between training speed and generalization in heavy-tailed regimes.

Abstract

Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.

Paper Structure

This paper contains 29 sections, 24 theorems, 220 equations, 2 figures.

Key Result

Theorem 2

Suppose that $\mathcal{A}$ is an $\varepsilon$-uniformly stable algorithm, then the expected generalization error is bounded by

Figures (2)

  • Figure 1: Experiments comparing SGD with and without momentum on synthetic data with quadratic loss $f$.
  • Figure 2: Comparing SGD with and without momentum, using the following model-dataset combinations: (top left) MNIST - FCN, (top right) MNIST - CNN, (bottom left) CIFAR-10 - FCN, and (bottom right) CIFAR-10 - CNN.

Theorems & Definitions (28)

  • Definition 1: hardt2016train, Definition 2.1
  • Theorem 2: hardt2016train, Theorem 2.2
  • Theorem 3
  • Corollary 4
  • Remark 5
  • Theorem 6
  • Corollary 7
  • Proposition 8
  • Remark 9
  • Remark 10
  • ...and 18 more