The information bottleneck method

Naftali Tishby; Fernando C. Pereira; William Bialek

The information bottleneck method

Naftali Tishby, Fernando C. Pereira, William Bialek

TL;DR

The paper introduces the Information Bottleneck (IB) method, a principled way to extract relevant information by compressing X through a bottleneck tilde X while preserving as much information about Y as possible. It formulates a variational principle, showing that the optimal encoding minimizes $I(X;\tilde{X})$ while maintaining information about Y via $I(\tilde{X};Y)$, leading to self-consistent equations whose solutions can be found by a convergent IB algorithm. Unlike standard rate-distortion, the distortion is data-driven, given by $D_{KL}(p(y|x)\|p(y|\tilde{x}))$, and the approach unifies prediction, filtering, and learning under a single information-theoretic framework. The IB framework yields a family of annealed, hierarchical representations, offering a scalable way to balance compression against predictive power across diverse domains. Practical implications include new methods for semantic clustering, document classification, neural coding, and related signal-processing tasks where relevance is paramount.

Abstract

We define the relevant information in a signal $x\in X$ as being the information that this signal provides about another signal $y\in \Y$. Examples include the information that face images provide about the names of the people portrayed, or the information that speech sounds provide about the words spoken. Understanding the signal $x$ requires more than just predicting $y$, it also requires specifying which features of $\X$ play a role in the prediction. We formalize this problem as that of finding a short code for $\X$ that preserves the maximum information about $\Y$. That is, we squeeze the information that $\X$ provides about $\Y$ through a `bottleneck' formed by a limited set of codewords $\tX$. This constrained optimization problem can be seen as a generalization of rate distortion theory in which the distortion measure $d(x,\x)$ emerges from the joint statistics of $\X$ and $\Y$. This approach yields an exact set of self consistent equations for the coding rules $X \to \tX$ and $\tX \to \Y$. Solutions to these equations can be found by a convergent re-estimation method that generalizes the Blahut-Arimoto algorithm. Our variational principle provides a surprisingly rich framework for discussing a variety of problems in signal processing and learning, as will be described in detail elsewhere.

The information bottleneck method

TL;DR

while maintaining information about Y via

, leading to self-consistent equations whose solutions can be found by a convergent IB algorithm. Unlike standard rate-distortion, the distortion is data-driven, given by

, and the approach unifies prediction, filtering, and learning under a single information-theoretic framework. The IB framework yields a family of annealed, hierarchical representations, offering a scalable way to balance compression against predictive power across diverse domains. Practical implications include new methods for semantic clustering, document classification, neural coding, and related signal-processing tasks where relevance is paramount.

Abstract

We define the relevant information in a signal

as being the information that this signal provides about another signal

. Examples include the information that face images provide about the names of the people portrayed, or the information that speech sounds provide about the words spoken. Understanding the signal

requires more than just predicting

, it also requires specifying which features of

play a role in the prediction. We formalize this problem as that of finding a short code for

that preserves the maximum information about

. That is, we squeeze the information that

provides about

through a `bottleneck' formed by a limited set of codewords

. This constrained optimization problem can be seen as a generalization of rate distortion theory in which the distortion measure

emerges from the joint statistics of

and

. This approach yields an exact set of self consistent equations for the coding rules

and

. Solutions to these equations can be found by a convergent re-estimation method that generalizes the Blahut-Arimoto algorithm. Our variational principle provides a surprisingly rich framework for discussing a variety of problems in signal processing and learning, as will be described in detail elsewhere.

The information bottleneck method

TL;DR

Abstract

The information bottleneck method

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (5)