Table of Contents
Fetching ...

Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks

Diego García-Pérez, Emilio Parrado-Hernández, John Shawe-Taylor

TL;DR

The paper advances PAC-Bayes risk certificates for neural networks by deriving two new explicit bounds (TRP and RTS) that tighten existing guarantees and facilitate gradient-based optimization of risk certificates. It introduces an implicit-differentiation–driven approach and a KL-modulating technique to align gradients when optimizing bounds, including for non-differentiable losses like the 0-1 loss. Empirical validation on MNIST and CIFAR-10 demonstrates non-vacuous generalization bounds for CIFAR-10 with shallower networks and shows improved certifiability when optimizing for the bound directly. The work provides practical algorithms and code to reproduce experiments, highlighting both the potential and current limitations of PAC-Bayes bounds in explaining deep learning success, and suggesting future architecture- and data-aware directions.

Abstract

This paper presents four theoretical contributions that improve the usability of risk certificates for neural networks based on PAC-Bayes bounds. First, two bounds on the KL divergence between Bernoulli distributions enable the derivation of the tightest explicit bounds on the true risk of classifiers across different ranges of empirical risk. The paper next focuses on the formalization of an efficient methodology based on implicit differentiation that enables the introduction of the optimization of PAC-Bayesian risk certificates inside the loss/objective function used to fit the network/model. The last contribution is a method to optimize bounds on non-differentiable objectives such as the 0-1 loss. These theoretical contributions are complemented with an empirical evaluation on the MNIST and CIFAR-10 datasets. In fact, this paper presents the first non-vacuous generalization bounds on CIFAR-10 for neural networks. Code to reproduce all experiments is available at github.com/Diegogpcm/pacbayesgradients.

Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks

TL;DR

The paper advances PAC-Bayes risk certificates for neural networks by deriving two new explicit bounds (TRP and RTS) that tighten existing guarantees and facilitate gradient-based optimization of risk certificates. It introduces an implicit-differentiation–driven approach and a KL-modulating technique to align gradients when optimizing bounds, including for non-differentiable losses like the 0-1 loss. Empirical validation on MNIST and CIFAR-10 demonstrates non-vacuous generalization bounds for CIFAR-10 with shallower networks and shows improved certifiability when optimizing for the bound directly. The work provides practical algorithms and code to reproduce experiments, highlighting both the potential and current limitations of PAC-Bayes bounds in explaining deep learning success, and suggesting future architecture- and data-aware directions.

Abstract

This paper presents four theoretical contributions that improve the usability of risk certificates for neural networks based on PAC-Bayes bounds. First, two bounds on the KL divergence between Bernoulli distributions enable the derivation of the tightest explicit bounds on the true risk of classifiers across different ranges of empirical risk. The paper next focuses on the formalization of an efficient methodology based on implicit differentiation that enables the introduction of the optimization of PAC-Bayesian risk certificates inside the loss/objective function used to fit the network/model. The last contribution is a method to optimize bounds on non-differentiable objectives such as the 0-1 loss. These theoretical contributions are complemented with an empirical evaluation on the MNIST and CIFAR-10 datasets. In fact, this paper presents the first non-vacuous generalization bounds on CIFAR-10 for neural networks. Code to reproduce all experiments is available at github.com/Diegogpcm/pacbayesgradients.

Paper Structure

This paper contains 18 sections, 5 theorems, 26 equations, 5 figures, 4 tables.

Key Result

Theorem 2.1

Let $S=\{(X_i,Y_i)\}_{i=1}^n \overset{\text{i.i.d.}}{\sim} Z$ with $n \geq 8$. For any distribution $Q_0$ independent from $S$, any distribution $Q$ and a bounded loss function $0 \leq l \leq 1$, it holds with probability $1-\delta$ that:

Figures (5)

  • Figure 1: Different upper bounds on the true risk of the posterior over the hypothesis, $L(Q)$, as a function of the right hand side of \ref{['eq: kl-bound']}, $K$, for a small (top) and a large (bottom) empirical risk $\hat{L}_S(Q)$. The dark green curve represents Maurer's bound, of which the rest are relaxations of, so the closer to the dark green curve the better the relaxation.
  • Figure 2: Map with the tightest bound (out of all the bounds studied in the paper) for different values of $K$ (right-hand side of \ref{['eq: kl-bound']}) and $\hat{L}_S(Q)$.
  • Figure 3: Risk certificates on cross-entropy loss (top) and zero-one loss (bottom) on different experiment runs. Color indicates training objective. The training objective functions work with $L^{xe}$ in both plots, which explains why the RTS bound outperforms the stronger bypassed Maurer's bound in the right plot.
  • Figure 4: Observed values of $\hat{L}_S^{01}$ plotted against the value of $\hat{L}_S^{xe}$ for the same experiment run. An approximately linear relationship is strongly implied.
  • Figure 5: Risk certificates computed for each MLP (one MLP per objective function and explored length scale of the prior $Q_0$) using the zero-one loss. The training objective functions work with $L^{01}$.

Theorems & Definitions (9)

  • Theorem 2.1: PAC-Bayes-kl
  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof
  • Theorem : \ref{['thm: TRP ineq']}
  • proof
  • Theorem : \ref{['thm: rts']}
  • proof