Table of Contents
Fetching ...

Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes

Itamar Harel, Yonathan Wolanowsky, Gal Vardi, Nathan Srebro, Daniel Soudry

TL;DR

The paper develops time-uniform, trajectory-agnostic generalization bounds for models trained via data-dependent Markov processes, with a focus on continuous Langevin dynamics. By proving a generalized second law for divergences to a Gibbs stationary distribution, it bounds the marginal distribution over parameters at any training time and translates these bounds into PAC-Bayes generalization gaps that depend only on initialization and the Gibbs potential. In the CLD case, the stationary distribution is explicit, yielding simple, dimension-free bounds that scale with the inverse temperature $\beta$ and the initialization loss, and extend to extensions such as state-dependent diffusion and restricted initializations. Overall, the work provides a simple, broadly applicable framework for understanding generalization under noisy, Markov-process-based training, with potential implications for SGLD and neural networks trained with noise.

Abstract

We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution $θ_0 \sim p_0$. We focus on Langevin dynamics with a positive temperature $β^{-1}$, i.e. gradient descent on a training loss $L$ with infinitesimal step size, perturbed with $β^{-1}$-variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by $\sqrt{(β\mathbb{E} L (θ_0) + \log(1/δ))/N}$ with probability $1-δ$ over the dataset, where $N$ is the sample size, and $\mathbb{E} L (θ_0) =O(1)$ with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.

Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes

TL;DR

The paper develops time-uniform, trajectory-agnostic generalization bounds for models trained via data-dependent Markov processes, with a focus on continuous Langevin dynamics. By proving a generalized second law for divergences to a Gibbs stationary distribution, it bounds the marginal distribution over parameters at any training time and translates these bounds into PAC-Bayes generalization gaps that depend only on initialization and the Gibbs potential. In the CLD case, the stationary distribution is explicit, yielding simple, dimension-free bounds that scale with the inverse temperature and the initialization loss, and extend to extensions such as state-dependent diffusion and restricted initializations. Overall, the work provides a simple, broadly applicable framework for understanding generalization under noisy, Markov-process-based training, with potential implications for SGLD and neural networks trained with noise.

Abstract

We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution . We focus on Langevin dynamics with a positive temperature , i.e. gradient descent on a training loss with infinitesimal step size, perturbed with -variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by with probability over the dataset, where is the sample size, and with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.

Paper Structure

This paper contains 50 sections, 19 theorems, 123 equations, 1 figure, 8 tables.

Key Result

Corollary 2.5

For any distribution $\nu$ and any time-invariant Markov process, and any stationary distribution $p_\infty$ that is Gibbs w.r.t. $\nu$ with potential $\Psi \geq 0$ (the Markov chain need not be ergodic, and need not converge to $p_\infty$), at any time $t\geq 0$:

Figures (1)

  • Figure 1: Parity Results. Left: Training error. Right: test error and generalization bound.

Theorems & Definitions (64)

  • Definition 2.1: Divergences
  • Definition 2.2: Gibbs distribution
  • Claim 2.3
  • Claim 2.4: Cover's Second Law of Thermodynamics
  • Corollary 2.5
  • Remark 2.6
  • Theorem 2.7
  • proof
  • Remark 2.8
  • Remark 2.9
  • ...and 54 more