The information bottleneck method
Naftali Tishby, Fernando C. Pereira, William Bialek
TL;DR
The paper introduces the Information Bottleneck (IB) method, a principled way to extract relevant information by compressing X through a bottleneck tilde X while preserving as much information about Y as possible. It formulates a variational principle, showing that the optimal encoding minimizes $I(X;\tilde{X})$ while maintaining information about Y via $I(\tilde{X};Y)$, leading to self-consistent equations whose solutions can be found by a convergent IB algorithm. Unlike standard rate-distortion, the distortion is data-driven, given by $D_{KL}(p(y|x)\|p(y|\tilde{x}))$, and the approach unifies prediction, filtering, and learning under a single information-theoretic framework. The IB framework yields a family of annealed, hierarchical representations, offering a scalable way to balance compression against predictive power across diverse domains. Practical implications include new methods for semantic clustering, document classification, neural coding, and related signal-processing tasks where relevance is paramount.
Abstract
We define the relevant information in a signal $x\in X$ as being the information that this signal provides about another signal $y\in \Y$. Examples include the information that face images provide about the names of the people portrayed, or the information that speech sounds provide about the words spoken. Understanding the signal $x$ requires more than just predicting $y$, it also requires specifying which features of $\X$ play a role in the prediction. We formalize this problem as that of finding a short code for $\X$ that preserves the maximum information about $\Y$. That is, we squeeze the information that $\X$ provides about $\Y$ through a `bottleneck' formed by a limited set of codewords $\tX$. This constrained optimization problem can be seen as a generalization of rate distortion theory in which the distortion measure $d(x,\x)$ emerges from the joint statistics of $\X$ and $\Y$. This approach yields an exact set of self consistent equations for the coding rules $X \to \tX$ and $\tX \to \Y$. Solutions to these equations can be found by a convergent re-estimation method that generalizes the Blahut-Arimoto algorithm. Our variational principle provides a surprisingly rich framework for discussing a variety of problems in signal processing and learning, as will be described in detail elsewhere.
