An Information Theoretic Perspective on Conformal Prediction

Alvaro H. C. Correia; Fabio Valerio Massoli; Christos Louizos; Arash Behboodi

An Information Theoretic Perspective on Conformal Prediction

Alvaro H. C. Correia, Fabio Valerio Massoli, Christos Louizos, Arash Behboodi

TL;DR

This work links conformal prediction (CP) to information theory by bounding the intrinsic uncertainty $H(Y|X)$ via three approaches: a data-processing–based DPI bound and two Fano-type bounds (simple and model-based). These bounds are turned into differentiable training objectives, enabling end-to-end learning of classifiers from scratch and guiding CP efficiency toward narrower prediction sets; they also provide a principled way to incorporate side information. Empirical results in centralized and federated settings show that the proposed bounds yield smaller average prediction sets than competing methods, and that side information consistently improves efficiency. The approach unifies uncertainty quantification with information-theoretic tools, offering robust training signals and practical gains for CP-based uncertainty estimation in diverse tasks and distributed settings.

Abstract

Conformal Prediction (CP) is a distribution-free uncertainty estimation framework that constructs prediction sets guaranteed to contain the true answer with a user-specified probability. Intuitively, the size of the prediction set encodes a general notion of uncertainty, with larger sets associated with higher degrees of uncertainty. In this work, we leverage information theory to connect conformal prediction to other notions of uncertainty. More precisely, we prove three different ways to upper bound the intrinsic uncertainty, as described by the conditional entropy of the target variable given the inputs, by combining CP with information theoretical inequalities. Moreover, we demonstrate two direct and useful applications of such connection between conformal prediction and information theory: (i) more principled and effective conformal training objectives that generalize previous approaches and enable end-to-end training of machine learning models from scratch, and (ii) a natural mechanism to incorporate side information into conformal prediction. We empirically validate both applications in centralized and federated learning settings, showing our theoretical results translate to lower inefficiency (average prediction set size) for popular CP methods.

An Information Theoretic Perspective on Conformal Prediction

TL;DR

This work links conformal prediction (CP) to information theory by bounding the intrinsic uncertainty

via three approaches: a data-processing–based DPI bound and two Fano-type bounds (simple and model-based). These bounds are turned into differentiable training objectives, enabling end-to-end learning of classifiers from scratch and guiding CP efficiency toward narrower prediction sets; they also provide a principled way to incorporate side information. Empirical results in centralized and federated settings show that the proposed bounds yield smaller average prediction sets than competing methods, and that side information consistently improves efficiency. The approach unifies uncertainty quantification with information-theoretic tools, offering robust training signals and practical gains for CP-based uncertainty estimation in diverse tasks and distributed settings.

Abstract

Paper Structure (46 sections, 14 theorems, 87 equations, 2 figures, 18 tables, 1 algorithm)

This paper contains 46 sections, 14 theorems, 87 equations, 2 figures, 18 tables, 1 algorithm.

Introduction
Background
Conformal Prediction
Conformal Prediction as List Decoding
Information Theory Applied to Conformal Prediction
Data Processing Inequality for Conformal Prediction
Model-Based Fano's Inequality and Variations
Conformal Training
Side Information
The Distributed Learning Setting
Related Work
Experiments
Conformal Training
Side Information
Federated Learning (FL)
...and 31 more sections

Key Result

Theorem 2.1

If $\{(X_i, Y_i)\}_{i}^n$ are i.i.d. (or only exchangeable), then for a new i.i.d. draw $(X_{test}, Y_{test})$, and for any $\alpha \in (0,1)$ and for any score function $s$ such that $\{S_i\}_{i=1}^n$ are almost surely distinct, then ${\mathcal{C}}(X_{test})$ as defined above satisfies

Figures (2)

Figure 1: Graphical model of SCP. ${\mathcal{D}}_{cal}$ is a calibration set, ${\mathcal{C}}(X)$ the prediction set, $\hat{Y}=f(X)$ the model prediction, and $E$ the event $\{Y \in {\mathcal{C}}(X)\}$. Square and round nodes are, respectively, deterministic and stochastic functions of their parents.
Figure 2: Expected $[\log|C(X)|]^+$ as a function of $\alpha$.

Theorems & Definitions (35)

Theorem 2.1: vovk2005algorithmiclei2018distribution
Proposition 3.1: DPI Bound
Proposition 3.2: Model-Based Fano Bound
Corollary 3.1: Simple Fano Bound
Definition B.1: Exchangeable Random Variables
Definition B.2: Rank
Remark B.3
Lemma B.4
Corollary B.1
proof
...and 25 more

An Information Theoretic Perspective on Conformal Prediction

TL;DR

Abstract

An Information Theoretic Perspective on Conformal Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (35)