Emergence of heavy tails in homogenized stochastic gradient descent
Zhe Jiao, Martin Keller-Ressel
TL;DR
The work tackles why neural network parameters trained with SGD tend to exhibit heavy-tailed distributions and how this tail behavior depends on optimization settings. By treating SGD as a diffusion via homogenized stochastic gradient descent (hSGD) and mapping its dynamics to Pearson diffusions, the authors derive explicit upper and lower bounds on the asymptotic tail-index η, providing quantitative links between learning rate, batch size, regularization, and data geometry. They validate the theory with experiments that show the SGD tails are well approximated by skew-t distributions, with empirical tails staying between the theoretical bounds and showing sensitivity to γ, B, and d. The findings challenge claims that Brownian-driven SDEs cannot capture SGD tails and offer a principled framework to relate tail behavior to generalization and optimization performance. This contributes a rigorous, quantitative lens on heavy tails in SGD and their implications for training dynamics and generalization in deep learning.
Abstract
It has repeatedly been observed that loss minimization by stochastic gradient descent (SGD) leads to heavy-tailed distributions of neural network parameters. Here, we analyze a continuous diffusion approximation of SGD, called homogenized stochastic gradient descent, show that it behaves asymptotically heavy-tailed, and give explicit upper and lower bounds on its tail-index. We validate these bounds in numerical experiments and show that they are typically close approximations to the empirical tail-index of SGD iterates. In addition, their explicit form enables us to quantify the interplay between optimization parameters and the tail-index. Doing so, we contribute to the ongoing discussion on links between heavy tails and the generalization performance of neural networks as well as the ability of SGD to avoid suboptimal local minima.
