Table of Contents
Fetching ...

A PAC-Bayesian Link Between Generalisation and Flat Minima

Maxime Haddouche, Paul Viallard, Umut Simsekli, Benjamin Guedj

TL;DR

The paper develops time-uniform PAC-Bayes bounds by integrating gradient information, Poincaré, and Log-Sobolev inequalities to connect flat minima with generalisation in overparameterised models. It shows that when the posterior satisfies a Poincaré inequality and a QSB-type condition, the generalisation error can be bounded by a flatness term involving gradient norms and a data-dependent KL term, with a transitory fast rate under suitable parameter choices. For Gibbs posteriors with log-Sobolev priors, the KL term is controlled by gradient information, linking optimization dynamics to generalisation, and a Wasserstein-based bound extends these insights to deterministic predictors. An empirical study on MNIST/Fashion-MNIST supports the QSB condition with $C<1$, suggesting that flat minima found during optimization contribute to improved generalisation. Overall, the work provides a gradient-informed PAC-Bayes framework that clarifies how flat minima influence generalisation in modern, overparameterised learning systems.

Abstract

Modern machine learning usually involves predictors in the overparameterised setting (number of trained parameters greater than dataset size), and their training yields not only good performance on training data, but also good generalisation capacity. This phenomenon challenges many theoretical results, and remains an open problem. To reach a better understanding, we provide novel generalisation bounds involving gradient terms. To do so, we combine the PAC-Bayes toolbox with Poincaré and Log-Sobolev inequalities, avoiding an explicit dependency on the dimension of the predictor space. Our results highlight the positive influence of flat minima (being minima with a neighbourhood nearly minimising the learning problem as well) on generalisation performance, involving directly the benefits of the optimisation phase.

A PAC-Bayesian Link Between Generalisation and Flat Minima

TL;DR

The paper develops time-uniform PAC-Bayes bounds by integrating gradient information, Poincaré, and Log-Sobolev inequalities to connect flat minima with generalisation in overparameterised models. It shows that when the posterior satisfies a Poincaré inequality and a QSB-type condition, the generalisation error can be bounded by a flatness term involving gradient norms and a data-dependent KL term, with a transitory fast rate under suitable parameter choices. For Gibbs posteriors with log-Sobolev priors, the KL term is controlled by gradient information, linking optimization dynamics to generalisation, and a Wasserstein-based bound extends these insights to deterministic predictors. An empirical study on MNIST/Fashion-MNIST supports the QSB condition with , suggesting that flat minima found during optimization contribute to improved generalisation. Overall, the work provides a gradient-informed PAC-Bayes framework that clarifies how flat minima influence generalisation in modern, overparameterised learning systems.

Abstract

Modern machine learning usually involves predictors in the overparameterised setting (number of trained parameters greater than dataset size), and their training yields not only good performance on training data, but also good generalisation capacity. This phenomenon challenges many theoretical results, and remains an open problem. To reach a better understanding, we provide novel generalisation bounds involving gradient terms. To do so, we combine the PAC-Bayes toolbox with Poincaré and Log-Sobolev inequalities, avoiding an explicit dependency on the dimension of the predictor space. Our results highlight the positive influence of flat minima (being minima with a neighbourhood nearly minimising the learning problem as well) on generalisation performance, involving directly the benefits of the optimisation phase.
Paper Structure (32 sections, 26 theorems, 81 equations, 1 figure)

This paper contains 32 sections, 26 theorems, 81 equations, 1 figure.

Key Result

Proposition 3

Given a distribution $\mathrm{Q}= \mathcal{N}(\mu, \Sigma)$, where $\mu$ is the mean and $\Sigma$ is the covariance matrix in $\mathbb{R}^d$. Then, for any $f \in \mathrm{H}^{1}(\mathrm{Q})$: Thus, the distribution $\mathrm{Q}$ is $\texttt{L-Sob}(c_{LS})$ with constant $c_{LS}(\mathrm{Q})=2\|\Sigma\|_{op}$ and is also $\texttt{Poinc}(c_{LS})$ with constant $c_{LS}(\mathrm{Q})=\|\Sigma\|_{op}$, wh

Figures (1)

  • Figure 1: Evolution of the test risks (with the $01$-loss and the bounded cross-entropy loss) and the value of $C$ during the training phase.

Theorems & Definitions (30)

  • Definition 1: Poincaré inequality
  • Definition 2: Log-Sobolev inequality
  • Proposition 3: gross1975logarithmicbrascamp1976extensionsbeckner1989generalized
  • Proposition 3
  • Example 1
  • Theorem 5
  • Corollary 5
  • Theorem 6
  • Corollary 7
  • Lemma 7
  • ...and 20 more