Recursive PAC-Bayes: A Frequentist Approach to Sequential Prior Updates with No Information Loss

Yi-Shan Wu; Yijie Zhang; Badr-Eddine Chérief-Abdellatif; Yevgeny Seldin

Recursive PAC-Bayes: A Frequentist Approach to Sequential Prior Updates with No Information Loss

Yi-Shan Wu, Yijie Zhang, Badr-Eddine Chérief-Abdellatif, Yevgeny Seldin

TL;DR

This work addresses the limitation of PAC-Bayes bounds that lose confidence information when priors are updated with data. It introduces Recursive PAC-Bayes (RPB), a decomposition that expresses the expected loss as an excess term relative to a downscaled prior loss plus a recursively bounded prior loss, enabling sequential prior updates without information loss. The authors generalize split-kl inequalities to general discrete variables and derive a Recursive PAC-Bayes bound that combines these elements across multiple data splits, preserving information from all data. Empirically, RP-Bayes on MNIST and Fashion-MNIST demonstrates improved test performance and significantly tighter bounds as recursion depth increases, suggesting practical benefits for sequential learning and data-efficient prior design.

Abstract

PAC-Bayesian analysis is a frequentist framework for incorporating prior knowledge into learning. It was inspired by Bayesian learning, which allows sequential data processing and naturally turns posteriors from one processing step into priors for the next. However, despite two and a half decades of research, the ability to update priors sequentially without losing confidence information along the way remained elusive for PAC-Bayes. While PAC-Bayes allows construction of data-informed priors, the final confidence intervals depend only on the number of points that were not used for the construction of the prior, whereas confidence information in the prior, which is related to the number of points used to construct the prior, is lost. This limits the possibility and benefit of sequential prior updates, because the final bounds depend only on the size of the final batch. We present a novel and, in retrospect, surprisingly simple and powerful PAC-Bayesian procedure that allows sequential prior updates with no information loss. The procedure is based on a novel decomposition of the expected loss of randomized classifiers. The decomposition rewrites the loss of the posterior as an excess loss relative to a downscaled loss of the prior plus the downscaled loss of the prior, which is bounded recursively. As a side result, we also present a generalization of the split-kl and PAC-Bayes-split-kl inequalities to discrete random variables, which we use for bounding the excess losses, and which can be of independent interest. In empirical evaluation the new procedure significantly outperforms state-of-the-art.

Recursive PAC-Bayes: A Frequentist Approach to Sequential Prior Updates with No Information Loss

TL;DR

Abstract

Paper Structure (34 sections, 5 theorems, 16 equations, 3 figures, 13 tables)

This paper contains 34 sections, 5 theorems, 16 equations, 3 figures, 13 tables.

Introduction
The evolution of data-informed priors and the idea of Recursive PAC-Bayes
Uninformed priors
Data-informed priors
Data-informed priors + excess loss
Recursive PAC-Bayes (new)
Split-kl and PAC-Bayes-split-kl inequalities for discrete random variables
Split-kl inequality
PAC-Bayes-Split-kl inequality
Recursive PAC-Bayes bound
Discussion
Experiments
Details of the optimization and evaluation procedure
Convexification of the loss functions
Relaxation of the PAC-Bayes-kl bound
...and 19 more sections

Key Result

Theorem 1

For any probability distribution $\pi$ on $\mathcal{H}$ that is independent of $S$ and any $\delta \in (0,1)$: where $\mathcal{P}$ is the set of all probability distributions on $\mathcal{H}$, including those dependent on $S$.

Figures (3)

Figure 1: Evolution of PAC-Bayes. The figure shows how data are used by different PAC-Bayes approaches. Dark yellow shows data used directly for optimization of the indicated quantities. Light yellow shows data involved indirectly through dependence on the prior. Light green shows data used for estimation of the indicated quantities. In Recursive PAC-Bayes data are released and used sequentially chunk-by-chunk, as indicated by the dashed lines. For example, in the $T=4$ case $\mathbb E_{\pi_1}[L(h)]$ is first evaluated on $S_1$ to construct $\pi_1$ and $\gamma_2$, then in the first recursion step on $S_1\cup S_2$, in the second step on $S_1\cup S_2\cup S_3$, and in the last step on all $S$.
Figure 2: Recursive Decomposition into Three Terms. The figure illustrates recursive decomposition of $\mathbb E_{\pi_3}[L(h)]$ into three terms based on equation \ref{['eq:RPB']}, and a geometric data split, as used in our experiments. The bottom line illustrates which data are used for construction of which distribution: $S_1$ for $\pi_1$; $S_2$ for $\pi_2$; and $S_3$ for $\pi_3$. The brackets above the data show which data are used for computing PAC-Bayes bounds for which term: $S_1\cup S_2\cup S_3$ for $\mathbb E_{\pi_1}[L(h)]$; $S_2\cup S_3$ for $\mathbb E_{\pi_2}[L(h)-\gamma_2\mathbb E_{\pi_1}[L(h')]]$; and $S_3$ for $\mathbb E_{\pi_3}[L(h) - \gamma_3\mathbb E_{\pi_2}[L(h')]]$. Note that a direct computation of a PAC-Bayes bound on $\mathbb E_{\pi_3}[L(h)]$ would have only allowed to use the data in $S_3$, as shown by the black dashed line. The figure illustrates that recursive decomposition provides more efficient use of the data. We also note that initially we start with poor priors, and so the $\mathop{\mathrm{KL}}\nolimits(\pi_t\|\pi_{t-1})$ term for small $t$ is expected to be large, but this is compensated by a small multiplicative factor $\prod_{i=t+1}^T\gamma_i$ and availability of a lot of data $\bigcup_{i=t}^T S_i$ for computing the PAC-Bayes bound. For example, $\mathbb E_{\pi_1}[L(h)]$ is multiplied by $\gamma_3\gamma_2$ and we can use all the data for computing a PAC-Bayes bound on this term. By the time we reach higher $t$, the priors $\pi_{t-1}$ get better, and the $\mathop{\mathrm{KL}}\nolimits(\pi_t\|\pi_{t-1})$ term in the bounds gets much smaller, and additionally the bounds benefit from the small variance of the excess loss. With geometric split of the data, we use little data to quickly move $\pi_t$ to a good region, and then we still have enough data for a good estimation of the later terms, like $\mathbb E_{\pi_3}[L(h) - \gamma_3\mathbb E_{\pi_2}[L(h')]]$.
Figure 3: Decomposition of a discrete random variable into a superposition of binary random variables. The figure illustrates a decomposition of a discrete random variable $Z$ with domain of four values $b_0 < b_1 < b_2 < b_3$ into a superposition of three binary random variables, $Z = b_0 + \sum_{j=1}^3 \alpha_j Z_{|j}$. A way to think about the decomposition is to compare it to a progress bar. In the illustration $Z$ takes value $b_2$, and so the random variables $Z_{|1}$ and $Z_{|2}$ corresponding to the first two segments "light up" (take value 1), whereas the random variable $Z_{|3}$ corresponding to the last segment remains "turned off" (takes value 0). The value of $Z$ equals the sum of the lengths $\alpha_j$ of the "lighted up" segments.

Theorems & Definitions (8)

Theorem 1: PAC-Bayes-$\mathop{\mathrm{kl}}\nolimits$ Inequality, See02, Mau04
Theorem 2: $\mathop{\mathrm{kl}}\nolimits$ Inequality Lan05FBBT21FBB22
Theorem 3: Split-$\mathop{\mathrm{kl}}\nolimits$ inequality for discrete random variables
proof
Theorem 4: PAC-Bayes-Split-kl Inequality
proof
Theorem 5: Recursive PAC-Bayes Bound
proof

Recursive PAC-Bayes: A Frequentist Approach to Sequential Prior Updates with No Information Loss

TL;DR

Abstract

Recursive PAC-Bayes: A Frequentist Approach to Sequential Prior Updates with No Information Loss

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (8)