Table of Contents
Fetching ...

Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent

Max Hennick, Stijn De Baerdemacker

TL;DR

The paper addresses how SGD dynamics relate to Bayesian inference by modeling weight diffusion as movement on a fractal loss landscape governed by the local learning coefficient $\lambda(w)$. It introduces a time-fractional Fokker-Planck formulation to capture subdiffusive SGD behavior and connects fractal dimensions to diffusion via homogenization, yielding stationary distributions that link SGD trajectories to Bayesian posteriors. The key contributions include formalizing the near-stability hypothesis, deriving relationships between LLC, spectral dimension $d_s$, and diffusion coefficients, and validating these predictions with MNIST experiments showing LLC and $d_s$ correlate with weight dynamics and generalization. This framework provides a principled explanation for how stochastic optimization and Bayesian sampling relate in high-dimensional, fractal loss landscapes, with implications for understanding hyperparameter effects and the role of adaptive optimizers in learning large models.

Abstract

We show that the behavior of stochastic gradient descent is related to Bayesian statistics by showing that SGD is effectively diffusion on a fractal landscape, where the fractal dimension can be accounted for in a purely Bayesian way. By doing this we show that SGD can be regarded as a modified Bayesian sampler which accounts for accessibility constraints induced by the fractal structure of the loss landscape. We verify our results experimentally by examining the diffusion of weights during training. These results offer insight into the factors which determine the learning process, and seemingly answer the question of how SGD and purely Bayesian sampling are related.

Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent

TL;DR

The paper addresses how SGD dynamics relate to Bayesian inference by modeling weight diffusion as movement on a fractal loss landscape governed by the local learning coefficient . It introduces a time-fractional Fokker-Planck formulation to capture subdiffusive SGD behavior and connects fractal dimensions to diffusion via homogenization, yielding stationary distributions that link SGD trajectories to Bayesian posteriors. The key contributions include formalizing the near-stability hypothesis, deriving relationships between LLC, spectral dimension , and diffusion coefficients, and validating these predictions with MNIST experiments showing LLC and correlate with weight dynamics and generalization. This framework provides a principled explanation for how stochastic optimization and Bayesian sampling relate in high-dimensional, fractal loss landscapes, with implications for understanding hyperparameter effects and the role of adaptive optimizers in learning large models.

Abstract

We show that the behavior of stochastic gradient descent is related to Bayesian statistics by showing that SGD is effectively diffusion on a fractal landscape, where the fractal dimension can be accounted for in a purely Bayesian way. By doing this we show that SGD can be regarded as a modified Bayesian sampler which accounts for accessibility constraints induced by the fractal structure of the loss landscape. We verify our results experimentally by examining the diffusion of weights during training. These results offer insight into the factors which determine the learning process, and seemingly answer the question of how SGD and purely Bayesian sampling are related.

Paper Structure

This paper contains 14 sections, 8 theorems, 46 equations, 14 figures.

Key Result

Lemma 3.1

Consider a subset $\mathcal{W} \subset W$ such that the effective diffusion coefficient $D_\xi$ is (approximately) constant on $\mathcal{W}$. Suppose then that there exists steady state solutions of the SGD-FFPE on this subset $w^*$ so $\mathcal{D}^\alpha_t p(w^*,t) = 0$. The steady-state distributi

Figures (14)

  • Figure 1: Weight displacement over time for a subset of model sizes over the MNIST dataset.
  • Figure 2: The final LLC vs. the spectral dimension. The size of the dots represents the number of parameters of the tested model. Note that none of these values fall below the line denoting the inequality of lemma \ref{['lem:spect_ineq']}.
  • Figure 3: The average vs. the spectral dimension. These results align with corollary \ref{['cor:av_spect']}.
  • Figure 4: The histogram of the diffusion exponent. Note that the diffusion exponent seems to concentrate among higher values, which agrees with the result of lemma \ref{['res:stat_state']}.
  • Figure 5: The corresponding changes in weight vs. the LLC.
  • ...and 9 more figures

Theorems & Definitions (13)

  • Definition 3.1: Effective Diffusion Coefficient
  • Lemma 3.1
  • Corollary 3.1
  • Lemma 3.2
  • Corollary 3.2
  • Lemma
  • proof
  • Corollary
  • proof
  • Lemma
  • ...and 3 more