An Information-Theoretic Approach to Generalization Theory

Borja Rodríguez-Gálvez; Ragnar Thobaben; Mikael Skoglund

An Information-Theoretic Approach to Generalization Theory

Borja Rodríguez-Gálvez, Ragnar Thobaben, Mikael Skoglund

TL;DR

This work develops an information-theoretic framework for understanding generalization in machine learning, focusing on in-distribution performance. It contrasts guarantees in expectation (via mutual information, Wasserstein distances, and conditional variants) with PAC-Bayesian high-probability bounds, and links privacy notions to generalization. The monograph provides a structured treatment across bounds based on information measures, geometry-aware metrics, and truncation/moment interpolation, as well as applications to noisy iterative algorithms like SGLD and SGD. It also discusses the limitations of these bounds, including scenarios where mutual-information-based bounds are provably loose, and situates the framework within broader generalization literature. Overall, the work offers a nuanced, algorithm- and data-aware view of generalization with practical implications for private and iterative learning methods.

Abstract

We investigate the in-distribution generalization of machine learning algorithms. We depart from traditional complexity-based approaches by analyzing information-theoretic bounds that quantify the dependence between a learning algorithm and the training data. We consider two categories of generalization guarantees: 1) Guarantees in expectation: These bounds measure performance in the average case. Here, the dependence between the algorithm and the data is often captured by information measures. While these measures offer an intuitive interpretation, they overlook the geometry of the algorithm's hypothesis class. Here, we introduce bounds using the Wasserstein distance to incorporate geometry, and a structured, systematic method to derive bounds capturing the dependence between the algorithm and an individual datum, and between the algorithm and subsets of the training data. 2) PAC-Bayesian guarantees: These bounds measure the performance level with high probability. Here, the dependence between the algorithm and the data is often measured by the relative entropy. We establish connections between the Seeger--Langford and Catoni's bounds, revealing that the former is optimized by the Gibbs posterior. We introduce novel, tighter bounds for various types of loss functions. To achieve this, we introduce a new technique to optimize parameters in probabilistic statements. To study the limitations of these approaches, we present a counter-example where most of the information-theoretic bounds fail while traditional approaches do not. Finally, we explore the relationship between privacy and generalization. We show that algorithms with a bounded maximal leakage generalize. For discrete data, we derive new bounds for differentially private algorithms that guarantee generalization even with a constant privacy parameter, which is in contrast to previous bounds in the literature.

An Information-Theoretic Approach to Generalization Theory

TL;DR

Abstract

Paper Structure (165 sections, 127 theorems, 522 equations, 15 figures, 1 table)

This paper contains 165 sections, 127 theorems, 522 equations, 15 figures, 1 table.

Introduction
Background
Overview of the Monograph
Guarantees in Expectation
PAC-Bayesian Guarantees
Connection Between Privacy and Generalization
Preliminaries
Probability Theory
Probability Spaces and Random Objects
Polish and Standard Borel Spaces
Conditional Probability and Conditional Expectation
Several Random Objects
Densities
Moments and Generating Functions
Information Theory
...and 150 more sections

Key Result

Theorem 2.1

Let $(\mathcal{X},\rho)$ be a Polish space. If $\mathcal{S}$ is a subset of $\mathcal{X}$, then $(\mathcal{S}, \mathcal{B}(\mathcal{S}))$ is a standard Borel space.

Figures (15)

Figure 1.1: Fictitious relationship between two biological markers and the presence (orange crosses) or absence (blue dots) of a disease. The marks without a box represent the training set available to the algorithm, and the ones with a box represent new, unseen data. On the left, there is a complex hypothesis that overfits the training set. On the right, there is a simple hypothesis that performs slightly worse in the training set but generalizes to new instances.
Figure 2.1: Illustration of a channel processing $X$ to $Y$.
Figure 2.2: Illustrations of a single random object $X$ processed with different channels (left) and two different random objects processed with the same channel (right).
Figure 3.1: Illustration of a learning algorithm $\mathbb{A}$ viewed as a channel processing a dataset $S$ to obtain a hypothesis $H$.
Figure 4.1: Illustration of a learning algorithm $\mathbb{A}$ viewed as a channel processing a dataset $S$ to obtain a hypothesis $W$ (left), and of the backward channel describing the processing of a hypothesis $W$ to obtain the dataset with which it was trained (right).
...and 10 more figures

Theorems & Definitions (178)

Definition 2.1
Definition 2.2
Definition 2.3
Definition 2.4
Theorem 2.1: Polish and standard Borel spaces gray2009probability
Definition 2.5
Theorem 2.2: Radon--Nikodym theorem mcdonald1999course
Definition 2.6
Definition 2.7
Definition 2.8
...and 168 more

An Information-Theoretic Approach to Generalization Theory

TL;DR

Abstract

An Information-Theoretic Approach to Generalization Theory

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (178)