Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes
Itamar Harel, Yonathan Wolanowsky, Gal Vardi, Nathan Srebro, Daniel Soudry
TL;DR
The paper develops time-uniform, trajectory-agnostic generalization bounds for models trained via data-dependent Markov processes, with a focus on continuous Langevin dynamics. By proving a generalized second law for divergences to a Gibbs stationary distribution, it bounds the marginal distribution over parameters at any training time and translates these bounds into PAC-Bayes generalization gaps that depend only on initialization and the Gibbs potential. In the CLD case, the stationary distribution is explicit, yielding simple, dimension-free bounds that scale with the inverse temperature $\beta$ and the initialization loss, and extend to extensions such as state-dependent diffusion and restricted initializations. Overall, the work provides a simple, broadly applicable framework for understanding generalization under noisy, Markov-process-based training, with potential implications for SGLD and neural networks trained with noise.
Abstract
We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution $θ_0 \sim p_0$. We focus on Langevin dynamics with a positive temperature $β^{-1}$, i.e. gradient descent on a training loss $L$ with infinitesimal step size, perturbed with $β^{-1}$-variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by $\sqrt{(β\mathbb{E} L (θ_0) + \log(1/δ))/N}$ with probability $1-δ$ over the dataset, where $N$ is the sample size, and $\mathbb{E} L (θ_0) =O(1)$ with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.
