A PAC-Bayesian Link Between Generalisation and Flat Minima

Maxime Haddouche; Paul Viallard; Umut Simsekli; Benjamin Guedj

A PAC-Bayesian Link Between Generalisation and Flat Minima

Maxime Haddouche, Paul Viallard, Umut Simsekli, Benjamin Guedj

TL;DR

The paper develops time-uniform PAC-Bayes bounds by integrating gradient information, Poincaré, and Log-Sobolev inequalities to connect flat minima with generalisation in overparameterised models. It shows that when the posterior satisfies a Poincaré inequality and a QSB-type condition, the generalisation error can be bounded by a flatness term involving gradient norms and a data-dependent KL term, with a transitory fast rate under suitable parameter choices. For Gibbs posteriors with log-Sobolev priors, the KL term is controlled by gradient information, linking optimization dynamics to generalisation, and a Wasserstein-based bound extends these insights to deterministic predictors. An empirical study on MNIST/Fashion-MNIST supports the QSB condition with $C<1$, suggesting that flat minima found during optimization contribute to improved generalisation. Overall, the work provides a gradient-informed PAC-Bayes framework that clarifies how flat minima influence generalisation in modern, overparameterised learning systems.

Abstract

Modern machine learning usually involves predictors in the overparameterised setting (number of trained parameters greater than dataset size), and their training yields not only good performance on training data, but also good generalisation capacity. This phenomenon challenges many theoretical results, and remains an open problem. To reach a better understanding, we provide novel generalisation bounds involving gradient terms. To do so, we combine the PAC-Bayes toolbox with Poincaré and Log-Sobolev inequalities, avoiding an explicit dependency on the dimension of the predictor space. Our results highlight the positive influence of flat minima (being minima with a neighbourhood nearly minimising the learning problem as well) on generalisation performance, involving directly the benefits of the optimisation phase.

A PAC-Bayesian Link Between Generalisation and Flat Minima

TL;DR

, suggesting that flat minima found during optimization contribute to improved generalisation. Overall, the work provides a gradient-informed PAC-Bayes framework that clarifies how flat minima influence generalisation in modern, overparameterised learning systems.

Abstract

Paper Structure (32 sections, 26 theorems, 81 equations, 1 figure)

This paper contains 32 sections, 26 theorems, 81 equations, 1 figure.

Introduction
Preliminaries
Reaching a flat minimum allows Poincaré posteriors to generalise well
Time-uniform estimation PAC-Bayes bounds for heavy-tailed losses
On the role of flat minima in PAC-Bayes learning.
A focus on $C$.
Towards fully empirical bound for gradient-Lipschitz functions
Generalisation ability of Gibbs distributions with a log-Sobolev prior
Controlling the KL divergence when $\mathrm{Q}$ is a Gibbs posterior.
Generalisation ability of Gibbs posteriors.
Comparison to literature.
On the benefits of the gradient norm in Wasserstein PAC-Bayes learning
Can \ref{['th:wpb-grad']} go to zero with large $m$?
An empirical study of \ref{['as:relaxed-bounded']} for neural networks
Empirical findings.
...and 17 more sections

Key Result

Proposition 3

Given a distribution $\mathrm{Q}= \mathcal{N}(\mu, \Sigma)$, where $\mu$ is the mean and $\Sigma$ is the covariance matrix in $\mathbb{R}^d$. Then, for any $f \in \mathrm{H}^{1}(\mathrm{Q})$: Thus, the distribution $\mathrm{Q}$ is $\texttt{L-Sob}(c_{LS})$ with constant $c_{LS}(\mathrm{Q})=2\|\Sigma\|_{op}$ and is also $\texttt{Poinc}(c_{LS})$ with constant $c_{LS}(\mathrm{Q})=\|\Sigma\|_{op}$, wh

Figures (1)

Figure 1: Evolution of the test risks (with the $01$-loss and the bounded cross-entropy loss) and the value of $C$ during the training phase.

Theorems & Definitions (30)

Definition 1: Poincaré inequality
Definition 2: Log-Sobolev inequality
Proposition 3: gross1975logarithmicbrascamp1976extensionsbeckner1989generalized
Proposition 3
Example 1
Theorem 5
Corollary 5
Theorem 6
Corollary 7
Lemma 7
...and 20 more

A PAC-Bayesian Link Between Generalisation and Flat Minima

TL;DR

Abstract

A PAC-Bayesian Link Between Generalisation and Flat Minima

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (30)