Generalization of the Gibbs algorithm with high probability at low temperatures
Andreas Maurer
TL;DR
This paper addresses generalization bounds for individual hypotheses drawn from the Gibbs posterior, including the challenging low-temperature regime where generalization depends on data-dependent loss landscapes. It develops a disintegrated PAC-Bayesian bound (Theorem Main) that links the generalization gap to the total prior volume of near-minimizers via the complexity term $\Lambda_{\beta}(h,\mathbf{x})$ and the empirical loss CDF $\hat{\varphi}$, and interprets its behavior across temperature regimes, margins, and the zero-temperature limit. The work further extends the framework to beyond-Gibbs stochastic algorithms with similar density-based bounds, and provides corollaries for sub-Gaussian losses and margin-based binary classification. The results offer theoretical justification for why flat minima and data-dependent loss landscapes can enhance generalization, and they illuminate how high- and low-temperature analyses relate to stochastic optimization methods like SGLD/SGD in deep learning.
Abstract
The paper gives a bound on the generalization error of the Gibbs algorithm, which recovers known data-independent bounds for the high temperature range and extends to the low-temperature range, where generalization depends critically on the data-dependent loss-landscape. It is shown, that with high probability the generalization error of a single hypothesis drawn from the Gibbs posterior decreases with the total prior volume of all hypotheses with similar or smaller empirical error. This gives theoretical support to the belief in the benefit of flat minima. The zero temperature limit is discussed and the bound is extended to a class of similar stochastic algorithms.
