Tighter Generalisation Bounds via Interpolation

Paul Viallard; Maxime Haddouche; Umut Şimşekli; Benjamin Guedj

Tighter Generalisation Bounds via Interpolation

Paul Viallard, Maxime Haddouche, Umut Şimşekli, Benjamin Guedj

TL;DR

This work develops a unifying PAC-Bayes framework based on $(f,\Gamma)$-divergences to tighten generalisation bounds by interpolating between $f$-divergences and IPMs such as Wasserstein. It introduces two generic bound templates and then instantiates them to KL–Wasserstein interpolation, as well as bounds beyond KL (reverse KL, Hellinger, TV), including applications to heavy-tailed SGD. A key contribution is showing how these interpolations connect to Rademacher complexity and how they yield tractable, practically usable training objectives. The experimental study demonstrates that jointly optimising a posterior and an intermediate distribution can improve generalisation on several datasets, particularly when a Dirac posterior is used. Overall, the paper provides a flexible, theory-guided approach to tighter generalisation bounds with practical learning algorithms.

Abstract

This paper contains a recipe for deriving new PAC-Bayes generalisation bounds based on the $(f, Γ)$-divergence, and, in addition, presents PAC-Bayes generalisation bounds where we interpolate between a series of probability divergences (including but not limited to KL, Wasserstein, and total variation), making the best out of many worlds depending on the posterior distributions properties. We explore the tightness of these bounds and connect them to earlier results from statistical learning, which are specific cases. We also instantiate our bounds as training objectives, yielding non-trivial guarantees and practical performances.

Tighter Generalisation Bounds via Interpolation

TL;DR

This work develops a unifying PAC-Bayes framework based on

-divergences to tighten generalisation bounds by interpolating between

-divergences and IPMs such as Wasserstein. It introduces two generic bound templates and then instantiates them to KL–Wasserstein interpolation, as well as bounds beyond KL (reverse KL, Hellinger, TV), including applications to heavy-tailed SGD. A key contribution is showing how these interpolations connect to Rademacher complexity and how they yield tractable, practically usable training objectives. The experimental study demonstrates that jointly optimising a posterior and an intermediate distribution can improve generalisation on several datasets, particularly when a Dirac posterior is used. Overall, the paper provides a flexible, theory-guided approach to tighter generalisation bounds with practical learning algorithms.

Abstract

This paper contains a recipe for deriving new PAC-Bayes generalisation bounds based on the

-divergence, and, in addition, presents PAC-Bayes generalisation bounds where we interpolate between a series of probability divergences (including but not limited to KL, Wasserstein, and total variation), making the best out of many worlds depending on the posterior distributions properties. We explore the tightness of these bounds and connect them to earlier results from statistical learning, which are specific cases. We also instantiate our bounds as training objectives, yielding non-trivial guarantees and practical performances.

Paper Structure (31 sections, 22 theorems, 148 equations, 70 tables)

This paper contains 31 sections, 22 theorems, 148 equations, 70 tables.

Introduction
Notation and Background
Elementary Steps Towards Generalisation
Generalisation Bounds with Various Complexity Measures
A Fundamental Example: a PAC-Bayes Bound Interpolating KL Divergence and Wasserstein
PAC-Bayes Bounds Beyond KL Divergence
Novel Connections in Statistical Learning
A Rigorous Link Between PAC-Bayesian and Rademacher-based Bounds
Generalisation Bounds for Heavy-tailed SDEs
Experimental Study
A Novel Learning Algorithm
Experiments
Conclusion
Supplementary Background on $\alpha$-stable Lévy Processes
Supplementary Results
...and 16 more sections

Key Result

Theorem 3.1

Let $\phi_\mathcal{S}\in\Gamma$, $\delta\in[0,1]$ and $\pi\in \mathcal{P}(\mathcal{H})$. With probability at least $1-\delta$ over $\mathcal{S}\sim\mathcal{D}^m$, we have for all $\rho\in\mathcal{P}(\mathcal{H})$

Theorems & Definitions (41)

Definition 2.1
Definition 2.2: $(f, \Gamma)$-divergence
Theorem 3.1
Theorem 3.2
Theorem 4.1
Theorem 4.2
Theorem 5.1
Corollary 5.1
Corollary 5.1
Theorem 6.1
...and 31 more

Tighter Generalisation Bounds via Interpolation

TL;DR

Abstract

Tighter Generalisation Bounds via Interpolation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (41)