Table of Contents
Fetching ...

Emergence of heavy tails in homogenized stochastic gradient descent

Zhe Jiao, Martin Keller-Ressel

TL;DR

The work tackles why neural network parameters trained with SGD tend to exhibit heavy-tailed distributions and how this tail behavior depends on optimization settings. By treating SGD as a diffusion via homogenized stochastic gradient descent (hSGD) and mapping its dynamics to Pearson diffusions, the authors derive explicit upper and lower bounds on the asymptotic tail-index η, providing quantitative links between learning rate, batch size, regularization, and data geometry. They validate the theory with experiments that show the SGD tails are well approximated by skew-t distributions, with empirical tails staying between the theoretical bounds and showing sensitivity to γ, B, and d. The findings challenge claims that Brownian-driven SDEs cannot capture SGD tails and offer a principled framework to relate tail behavior to generalization and optimization performance. This contributes a rigorous, quantitative lens on heavy tails in SGD and their implications for training dynamics and generalization in deep learning.

Abstract

It has repeatedly been observed that loss minimization by stochastic gradient descent (SGD) leads to heavy-tailed distributions of neural network parameters. Here, we analyze a continuous diffusion approximation of SGD, called homogenized stochastic gradient descent, show that it behaves asymptotically heavy-tailed, and give explicit upper and lower bounds on its tail-index. We validate these bounds in numerical experiments and show that they are typically close approximations to the empirical tail-index of SGD iterates. In addition, their explicit form enables us to quantify the interplay between optimization parameters and the tail-index. Doing so, we contribute to the ongoing discussion on links between heavy tails and the generalization performance of neural networks as well as the ability of SGD to avoid suboptimal local minima.

Emergence of heavy tails in homogenized stochastic gradient descent

TL;DR

The work tackles why neural network parameters trained with SGD tend to exhibit heavy-tailed distributions and how this tail behavior depends on optimization settings. By treating SGD as a diffusion via homogenized stochastic gradient descent (hSGD) and mapping its dynamics to Pearson diffusions, the authors derive explicit upper and lower bounds on the asymptotic tail-index η, providing quantitative links between learning rate, batch size, regularization, and data geometry. They validate the theory with experiments that show the SGD tails are well approximated by skew-t distributions, with empirical tails staying between the theoretical bounds and showing sensitivity to γ, B, and d. The findings challenge claims that Brownian-driven SDEs cannot capture SGD tails and offer a principled framework to relate tail behavior to generalization and optimization performance. This contributes a rigorous, quantitative lens on heavy tails in SGD and their implications for training dynamics and generalization in deep learning.

Abstract

It has repeatedly been observed that loss minimization by stochastic gradient descent (SGD) leads to heavy-tailed distributions of neural network parameters. Here, we analyze a continuous diffusion approximation of SGD, called homogenized stochastic gradient descent, show that it behaves asymptotically heavy-tailed, and give explicit upper and lower bounds on its tail-index. We validate these bounds in numerical experiments and show that they are typically close approximations to the empirical tail-index of SGD iterates. In addition, their explicit form enables us to quantify the interplay between optimization parameters and the tail-index. Doing so, we contribute to the ongoing discussion on links between heavy tails and the generalization performance of neural networks as well as the ability of SGD to avoid suboptimal local minima.
Paper Structure (21 sections, 6 theorems, 84 equations, 2 figures, 3 tables)

This paper contains 21 sections, 6 theorems, 84 equations, 2 figures, 3 tables.

Key Result

Theorem 3.1

For $i = 1, \cdots, d$, let $(Z^{i}_t)_{t\geqslant 0}$ be the components of the rescaled hsgd from eq:sde_Z and $(\hat{Z}^{i}_t)_{t\geqslant 0}$ be the independent Pearon diffusion from PearsonIV. Then for any $t\geqslant0$ and convex function $g: \mathbb{R}\rightarrow\mathbb{R}$ it holds that In particular this implies the ordering of $p$-moments for all $p \ge 1$.

Figures (2)

  • Figure 1: (a)-(c) Quantile-Quantile plots of fitted t-distribution against empirical SGD iterates; (d)-(f) Quantile-Quantile plots of fitted $\alpha$-stable distribution against empirical SGD iterates. (g) Complementary cumulative distribution function (ccdf) of $t$-distribution with different tail indices; (h)-(j) Comparison between ccdf of empirical data and t-distribution parameterized by upper tail-index bound $\eta^*$ and lower bound $\eta_*$.
  • Figure 2: Empirical complementary cumulative distribution functions on log-log scale for the effect of varying parameters.

Theorems & Definitions (13)

  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Definition 2.5
  • Theorem 3.1
  • Lemma 3.2
  • Lemma 3.3
  • proof : Proof of Theorem \ref{['thm:main1']}
  • Theorem 3.4
  • Theorem 3.5
  • ...and 3 more