Better-than-KL PAC-Bayes Bounds

Ilja Kuzborskij; Kwang-Sung Jun; Yulian Wu; Kyoungseok Jang; Francesco Orabona

Better-than-KL PAC-Bayes Bounds

Ilja Kuzborskij, Kwang-Sung Jun, Yulian Wu, Kyoungseok Jang, Francesco Orabona

TL;DR

This work challenges the long-standing use of the KL divergence as the sole complexity measure in PAC-Bayes bounds by introducing the ZCP divergence, which jointly captures KL and total-variation terms. Through a coin-betting regret framework and Ville's inequality, the authors derive a high-probability PAC-Bayes bound with the ZCP divergence that is never worse than KL and can be strictly tighter in discrete and Gaussian-mixture regimes. They further show how this framework recovers and unifies several known bounds, including empirical Bernstein and Bernoulli KL-type bounds, while extending to fast-rate regimes. The results suggest the possibility of optimal-rate PAC-Bayes bounds and offer a new tool for tighter, data-dependent generalization guarantees with potential impact on learning theory and practice. The analysis hinges on a novel change-of-measure argument based on an $x^2/2$-type potential, which may be of independent interest for concentration theory.

Abstract

Let $f(θ, X_1),$ $ \dots,$ $ f(θ, X_n)$ be a sequence of random elements, where $f$ is a fixed scalar function, $X_1, \dots, X_n$ are independent random variables (data), and $θ$ is a random parameter distributed according to some data-dependent posterior distribution $P_n$. In this paper, we consider the problem of proving concentration inequalities to estimate the mean of the sequence. An example of such a problem is the estimation of the generalization error of some predictor trained by a stochastic algorithm, such as a neural network where $f$ is a loss function. Classically, this problem is approached through a PAC-Bayes analysis where, in addition to the posterior, we choose a prior distribution which captures our belief about the inductive bias of the learning problem. Then, the key quantity in PAC-Bayes concentration bounds is a divergence that captures the complexity of the learning problem where the de facto standard choice is the KL divergence. However, the tightness of this choice has rarely been questioned. In this paper, we challenge the tightness of the KL-divergence-based bounds by showing that it is possible to achieve a strictly tighter bound. In particular, we demonstrate new high-probability PAC-Bayes bounds with a novel and better-than-KL divergence that is inspired by Zhang et al. (2022). Our proof is inspired by recent advances in regret analysis of gambling algorithms, and its use to derive concentration inequalities. Our result is first-of-its-kind in that existing PAC-Bayes bounds with non-KL divergences are not known to be strictly better than KL. Thus, we believe our work marks the first step towards identifying optimal rates of PAC-Bayes bounds.

Better-than-KL PAC-Bayes Bounds

TL;DR

-type potential, which may be of independent interest for concentration theory.

Abstract

Let

be a sequence of random elements, where

is a fixed scalar function,

are independent random variables (data), and

is a random parameter distributed according to some data-dependent posterior distribution

. In this paper, we consider the problem of proving concentration inequalities to estimate the mean of the sequence. An example of such a problem is the estimation of the generalization error of some predictor trained by a stochastic algorithm, such as a neural network where

is a loss function. Classically, this problem is approached through a PAC-Bayes analysis where, in addition to the posterior, we choose a prior distribution which captures our belief about the inductive bias of the learning problem. Then, the key quantity in PAC-Bayes concentration bounds is a divergence that captures the complexity of the learning problem where the de facto standard choice is the KL divergence. However, the tightness of this choice has rarely been questioned. In this paper, we challenge the tightness of the KL-divergence-based bounds by showing that it is possible to achieve a strictly tighter bound. In particular, we demonstrate new high-probability PAC-Bayes bounds with a novel and better-than-KL divergence that is inspired by Zhang et al. (2022). Our proof is inspired by recent advances in regret analysis of gambling algorithms, and its use to derive concentration inequalities. Our result is first-of-its-kind in that existing PAC-Bayes bounds with non-KL divergences are not known to be strictly better than KL. Thus, we believe our work marks the first step towards identifying optimal rates of PAC-Bayes bounds.

Paper Structure (30 sections, 16 theorems, 89 equations)

This paper contains 30 sections, 16 theorems, 89 equations.

Introduction
Our contributions
Additional related work
PAC-Bayes
Other divergences and connection to change-of-measure inequalities
Concentration from coin-betting
Definitions and preliminaries
Coin-betting game, regret, and Ville's inequality
The ZCP Divergence
Advantage over ${\normalfont\text{KL}}$ Divergence in Discrete Cases
Multivariate instances
Advantage over ${\normalfont\text{KL}}$ Divergence in the Mixture of Gaussian Case
Main results
Recovering variants of other known bounds with new divergence
Empirical Bernstein inequality
...and 15 more sections

Key Result

Theorem 1

Let $Z_1, \ldots, Z_n$ be a sequence of non-negative random variables such that $\mathbb{E}[Z_i \mid Z_1, \ldots, Z_{i-1}] = 0$. Let $M_t > 0$ be $\Sigma(Z_1, \dots, Z_{t-1})$-measurable such that $M_0 = 1$, and moreover assume that $\mathbb{E}[M_t \mid Z_1, \dots, Z_{t-1}] \leq M_{t-1}$. Then, for

Theorems & Definitions (20)

Theorem 1: Ville's inequality Ville39
Theorem 2
Lemma 3
Lemma 4
Proposition 5
Theorem 6: Hoeffding-type ${\normalfont\text{ZCP}}$ inequality
Remark 7
Theorem 8: Log-wealth ${\normalfont\text{ZCP}}$ inequality
Corollary 9
Remark 10
...and 10 more

Better-than-KL PAC-Bayes Bounds

TL;DR

Abstract

Better-than-KL PAC-Bayes Bounds

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (20)