Table of Contents
Fetching ...

User-friendly introduction to PAC-Bayes bounds

Pierre Alquier

TL;DR

The paper provides a coherent, accessible introduction to PAC-Bayes bounds, reframing generalization as a probabilistic statement over predictor distributions rather than a single estimator. It introduces the Gibbs/posterior framework, derives Catoni's simple bound, and shows how bounds extend to aggregation, single draws, and expectations, including practical guidance on choosing hyperparameters and priors. It surveys a spectrum of tighter bounds (KL-based, Bernstein-based, and localization techniques), their applicability to deep learning, and the emergence of data-dependent priors for tighter certificates. The discussion further extends PAC-Bayes to non-bounded losses, dependent data, and non-i.i.d. settings, and connects these ideas to related statistics and information-theoretic approaches, illustrating the broad versatility and practical potential of PAC-Bayes methods in modern ML contexts.

Abstract

Aggregated predictors are obtained by making a set of basic predictors vote according to some weights, that is, to some probability distribution. Randomized predictors are obtained by sampling in a set of basic predictors, according to some prescribed probability distribution. Thus, aggregated and randomized predictors have in common that they are not defined by a minimization problem, but by a probability distribution on the set of predictors. In statistical learning theory, there is a set of tools designed to understand the generalization ability of such procedures: PAC-Bayesian or PAC-Bayes bounds. Since the original PAC-Bayes bounds of D. McAllester, these tools have been considerably improved in many directions (we will for example describe a simplified version of the localization technique of O. Catoni that was missed by the community, and later rediscovered as "mutual information bounds"). Very recently, PAC-Bayes bounds received a considerable attention: for example there was workshop on PAC-Bayes at NIPS 2017, "(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights", organized by B. Guedj, F. Bach and P. Germain. One of the reason of this recent success is the successful application of these bounds to neural networks by G. Dziugaite and D. Roy. An elementary introduction to PAC-Bayes theory is still missing. This is an attempt to provide such an introduction.

User-friendly introduction to PAC-Bayes bounds

TL;DR

The paper provides a coherent, accessible introduction to PAC-Bayes bounds, reframing generalization as a probabilistic statement over predictor distributions rather than a single estimator. It introduces the Gibbs/posterior framework, derives Catoni's simple bound, and shows how bounds extend to aggregation, single draws, and expectations, including practical guidance on choosing hyperparameters and priors. It surveys a spectrum of tighter bounds (KL-based, Bernstein-based, and localization techniques), their applicability to deep learning, and the emergence of data-dependent priors for tighter certificates. The discussion further extends PAC-Bayes to non-bounded losses, dependent data, and non-i.i.d. settings, and connects these ideas to related statistics and information-theoretic approaches, illustrating the broad versatility and practical potential of PAC-Bayes methods in modern ML contexts.

Abstract

Aggregated predictors are obtained by making a set of basic predictors vote according to some weights, that is, to some probability distribution. Randomized predictors are obtained by sampling in a set of basic predictors, according to some prescribed probability distribution. Thus, aggregated and randomized predictors have in common that they are not defined by a minimization problem, but by a probability distribution on the set of predictors. In statistical learning theory, there is a set of tools designed to understand the generalization ability of such procedures: PAC-Bayesian or PAC-Bayes bounds. Since the original PAC-Bayes bounds of D. McAllester, these tools have been considerably improved in many directions (we will for example describe a simplified version of the localization technique of O. Catoni that was missed by the community, and later rediscovered as "mutual information bounds"). Very recently, PAC-Bayes bounds received a considerable attention: for example there was workshop on PAC-Bayes at NIPS 2017, "(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights", organized by B. Guedj, F. Bach and P. Germain. One of the reason of this recent success is the successful application of these bounds to neural networks by G. Dziugaite and D. Roy. An elementary introduction to PAC-Bayes theory is still missing. This is an attempt to provide such an introduction.

Paper Structure

This paper contains 68 sections, 37 theorems, 299 equations.

Key Result

Proposition 1.1

For any $\theta\in\Theta$, for any $\delta\in(0,1)$,

Theorems & Definitions (74)

  • Proposition 1.1
  • Lemma 1.1: Hoeffding's inequality
  • proof : Proof of Proposition \ref{['prop:with:hoeffding']}
  • Theorem 1.2
  • proof
  • Remark 1.1
  • Definition 1.1
  • Definition 1.2
  • Example 1.1
  • Proposition 1.2
  • ...and 64 more