Table of Contents
Fetching ...

How good is PAC-Bayes at explaining generalisation?

Antoine Picard-Weibel, Eugenio Clerico, Roman Moscoviz, Benjamin Guedj

TL;DR

The paper investigates when PAC-Bayes bounds meaningfully guarantee generalisation, arguing that the tightest bounds depend solely on the prior’s induced distribution over empirical risk, encapsulated by the push-forward $\pi^{\#R}$. It shows that improving the bound requires the prior to allocate significant mass to low-risk predictors, and derives a quantile-based protocol to assess prior sufficiency. By applying this to Catoni's bound, the authors obtain explicit forms for the minimum bound and the corresponding prior mass requirements, revealing that achieving tight guarantees demands extremely small prior mass on near-optimal predictors in realistic settings. The work highlights fundamental limitations for interpreting PAC-Bayes as explaining generalisation in deep learning, especially with data-dependent priors, and argues for integrating additional theoretical principles or prior knowledge into the learning objective to obtain genuinely informative insights.

Abstract

We discuss necessary conditions for a PAC-Bayes bound to provide a meaningful generalisation guarantee. Our analysis reveals that the optimal generalisation guarantee depends solely on the distribution of the risk induced by the prior distribution. In particular, achieving a target generalisation level is only achievable if the prior places sufficient mass on high-performing predictors. We relate these requirements to the prevalent practice of using data-dependent priors in deep learning PAC-Bayes applications, and discuss the implications for the claim that PAC-Bayes ``explains'' generalisation.

How good is PAC-Bayes at explaining generalisation?

TL;DR

The paper investigates when PAC-Bayes bounds meaningfully guarantee generalisation, arguing that the tightest bounds depend solely on the prior’s induced distribution over empirical risk, encapsulated by the push-forward . It shows that improving the bound requires the prior to allocate significant mass to low-risk predictors, and derives a quantile-based protocol to assess prior sufficiency. By applying this to Catoni's bound, the authors obtain explicit forms for the minimum bound and the corresponding prior mass requirements, revealing that achieving tight guarantees demands extremely small prior mass on near-optimal predictors in realistic settings. The work highlights fundamental limitations for interpreting PAC-Bayes as explaining generalisation in deep learning, especially with data-dependent priors, and argues for integrating additional theoretical principles or prior knowledge into the learning objective to obtain genuinely informative insights.

Abstract

We discuss necessary conditions for a PAC-Bayes bound to provide a meaningful generalisation guarantee. Our analysis reveals that the optimal generalisation guarantee depends solely on the distribution of the risk induced by the prior distribution. In particular, achieving a target generalisation level is only achievable if the prior places sufficient mass on high-performing predictors. We relate these requirements to the prevalent practice of using data-dependent priors in deep learning PAC-Bayes applications, and discuss the implications for the claim that PAC-Bayes ``explains'' generalisation.

Paper Structure

This paper contains 22 sections, 9 theorems, 40 equations, 3 figures, 1 table.

Key Result

Theorem 1

Fix $D$ and $b$ satisfying assump:convexD. There is a map $B^{\textup{min}}_{D, b}:\Pi_{\mathbb{R}^+}\to\mathbb{R}$ such that, for any $\Gamma$, any $B\in \mathcal{B}_\Gamma$ of class $\mathcal{B}(D, b)$, any $R\in\mathcal{F}_\Gamma^+$, any $\pi_{\textup{p}}\in\Pi_\Gamma$, and any $\delta\in[0,1]$, Thus, the infimum of every $B$ of class $\mathcal{B}(D, b)$ is fully determined by the risk prior $

Figures (3)

  • Figure 1: Comparison of $\overline{Q}_{\textup{Cat}, \lambda_{\textup{opt}}}$ and $\overline{Q}_{\textup{Cat}}$. The target generalisation gap is fixed to $G=0.015$, the number of observations to 60 000 and the confidence level to $1 - \delta = 1- 0.035$. The temperature free requirement $\overline{Q}_{\textup{Cat}}$ exhibits a phase transition at $r_{\textup{thresh}} = G - 2\sqrt{-\log(\delta)/\left( 8n \right) } \simeq 0.009714$. While $\overline{Q}_{\textup{Cat},\lambda_{\textup{opt}}}$ provides a good approximation of the point wise minimum for large value of $r$, it leads to more restrictive quantile requirements for $r<r_{\textup{thresh}}$. This graph implies that no prior putting less than 5.3e-11 mass on predictors with risk smaller than 0.01 can hope to obtain a generalisation guarantee higher than 0.015 valid with confidence level of 0.965 by training Catoni's bound on datasets with 60 000 samples (such as MNIST).
  • Figure 2: Evaluation of $\overline{Q}_{\textup{cat},\lambda}$ as a function of $r$ for different temperatures. The target generalisation gap is fixed to $G=0.015$, the number of observations to 60 000 and the confidence level to $1 - \delta = 1- 0.035$. 40 temperatures between $\lambda_{\textup{min}}$ and $\lambda_{\textup{max}}$ are assessed (blue denotes lower temperature, red larger temperature). In black, the minima of the risk requirement over all temperatures is plotted.
  • Figure 3: Evaluation of $\pi_{\textup{p}}\left( R \leq 0.015 \right)$ in the uninformed prior setting for varying number of class and number of clusters.

Theorems & Definitions (15)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Theorem 2
  • Corollary 1
  • proof
  • Corollary 2
  • proof
  • Theorem 3
  • Lemma 1
  • ...and 5 more