Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise
Thanh Dang, Melih Barsbey, A K M Rokonuzzaman Sonet, Mert Gurbuzbalaban, Umut Simsekli, Lingjiong Zhu
TL;DR
The paper addresses how heavy-tailed gradient noise interacts with momentum in stochastic optimization, focusing on SGD with momentum (SGDm). It models SGDm as an $\alpha$-stable Lévy-driven SDE and proves a $\mathcal{W}_1$ algorithmic stability bound, from which a generalization bound for Lipschitz surrogates follows; it also provides explicit results for quadratic losses showing momentum can worsen generalization. A novel uniform-in-time discretization bound connects the continuous SDE behavior to discrete-time SGDm, and the authors substantiate their theory with synthetic quadratic experiments and neural-network tests on MNIST and CIFAR-10. The findings indicate momentum may degrade generalization under heavy-tailed noise, guiding practical choices of momentum and step-size and suggesting directions for future work on the trade-offs between training speed and generalization in heavy-tailed regimes.
Abstract
Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.
