Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification

Xiaohan Zhu; Mesrob I. Ohannessian; Nathan Srebro

Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification

Xiaohan Zhu, Mesrob I. Ohannessian, Nathan Srebro

Abstract

We consider a PAC-Bayes type learning rule for binary classification, balancing the training error of a randomized ''posterior'' predictor with its KL divergence to a pre-specified ''prior''. This can be seen as an extension of a modified two-part-code Minimum Description Length (MDL) learning rule, to continuous priors and randomized predictions. With a balancing parameter of $λ=1$ this learning rule recovers an (empirical) Bayes posterior and a modified variant recovers the profile posterior, linking with standard Bayesian prediction (up to the treatment of the single-parameter noise level). However, from a risk-minimization prediction perspective, this Bayesian predictor overfits and can lead to non-vanishing excess loss in the agnostic case. Instead a choice of $λ\gg 1$, which can be seen as using a sample-size-dependent-prior, ensures uniformly vanishing excess loss even in the agnostic case. We precisely characterize the effect of under-regularizing (and over-regularizing) as a function of the balance parameter $λ$, understanding the regimes in which this under-regularization is tempered or catastrophic. This work extends previous work by Zhu and Srebro [2025] that considered only discrete priors to PAC Bayes type learning rules and, through their rigorous Bayesian interpretation, to Bayesian prediction more generally.

Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification

Abstract

this learning rule recovers an (empirical) Bayes posterior and a modified variant recovers the profile posterior, linking with standard Bayesian prediction (up to the treatment of the single-parameter noise level). However, from a risk-minimization prediction perspective, this Bayesian predictor overfits and can lead to non-vanishing excess loss in the agnostic case. Instead a choice of

, which can be seen as using a sample-size-dependent-prior, ensures uniformly vanishing excess loss even in the agnostic case. We precisely characterize the effect of under-regularizing (and over-regularizing) as a function of the balance parameter

, understanding the regimes in which this under-regularization is tempered or catastrophic. This work extends previous work by Zhu and Srebro [2025] that considered only discrete priors to PAC Bayes type learning rules and, through their rigorous Bayesian interpretation, to Bayesian prediction more generally.

Paper Structure (21 sections, 22 theorems, 43 equations, 1 figure)

This paper contains 21 sections, 22 theorems, 43 equations, 1 figure.

Introduction
Formal Setup
Unifying MDL and PAC-Bayes Limiting Errors
PAC-Bayes as Empirical Bayes
Full Bayesian Posterior and Profile Posterior
Limiting Behavior of Profile Posterior
Open Questions
Overview of PAC Bayes Upper and Lower Bounds Proofs
Upper Bound for $0 < \lambda < \infty$
Lower Bound for $0 < \lambda < \infty$
Summary
Upper Bounds for the PAC-Bayes Learning Rule
Lower Bounds for the PAC-Bayes Learning Rule
Lower Bound for $0 < \lambda < \infty$ (proof of \ref{['MDL LB']})
Proof of \ref{['Q:LAMBDA0']}
...and 6 more sections

Key Result

Theorem 3.1

(1) For any $0<\lambda\leq 1$, any source distribution $\mathcal{D}$, any distribution $Q^*$, any valid prior distribution $\Pi$, and any $m$: (2) For any $\lambda > 1$, any source distribution $\mathcal{D}$, any distribution $Q^*$, any valid prior distribution $\Pi$, and any $m$: Where $O(\cdot)$ only hides an absolute constant, that does not depend on $\mathcal{D}, \Pi$, or anything else.

Figures (1)

Figure 1: The function $\ell_\lambda(L^*)$ of equation \ref{['ell']}. When the PAC-Bayes rule is used with a fixed $\lambda$, the corresponding curve describes the best possible worst-case guarantee on the limiting error, across agnostic noise levels $L^*$. The curves are always above the diagonal (overfitting), but approach it as $\lambda\to \infty$, which is necessary for consistency (learning). For $\lambda\geq 1$, the overfitting is "tempered", meaning that the limiting error is less than $\frac{1}{2}$ (better than chance). For $\lambda<1$, this is only the case for $L^*<H^{-1}(\lambda)$, indicated by blue dots. For $\lambda=0$ the overfitting can be catastrophic, with worst case limiting error always $1$. [Reproduced with permission from ZS.]

Theorems & Definitions (42)

Definition 2.1: Worst-case (uniform) limiting error
Definition 2.2: Worst-case (per-instance) limiting error
Theorem 3.1: Agnostic Upper Bound
Theorem 3.2: Agnostic Lower Bound
Corollary 3.2.1
Theorem 3.3
Theorem 3.4
Corollary 3.4.1
Theorem 3.5
Theorem 4.1
...and 32 more

Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification

Abstract

Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification

Authors

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (42)