Table of Contents
Fetching ...

A Generalized Information Bottleneck Theory of Deep Learning

Charles Westphal, Stephen Hailes, Mirco Musolesi

TL;DR

The paper tackles theoretical ambiguities in the Information Bottleneck (IB) framework by proposing a Generalized Information Bottleneck (GIB) that foregrounds synergy among input features. It introduces a PMI-based reweighting and a feature-wise synergy decomposition using interaction information, and proves that, under perfect estimation, the classical IB objective is bounded by the GIB objective, while also resolving issues like infinite compression. Empirically, GIB yields consistent compression phases across activations and architectures, and its complexity term aligns with adversarial robustness, offering interpretable learning dynamics in CNNs and Transformers. The work demonstrates that synergistic feature processing improves generalization and provides a practical, more complete lens for analyzing deep learning representations with potential implications for robust and transferable models.

Abstract

The Information Bottleneck (IB) principle offers a compelling theoretical framework to understand how neural networks (NNs) learn. However, its practical utility has been constrained by unresolved theoretical ambiguities and significant challenges in accurate estimation. In this paper, we present a \textit{Generalized Information Bottleneck (GIB)} framework that reformulates the original IB principle through the lens of synergy, i.e., the information obtainable only through joint processing of features. We provide theoretical and empirical evidence demonstrating that synergistic functions achieve superior generalization compared to their non-synergistic counterparts. Building on these foundations we re-formulate the IB using a computable definition of synergy based on the average interaction information (II) of each feature with those remaining. We demonstrate that the original IB objective is upper bounded by our GIB in the case of perfect estimation, ensuring compatibility with existing IB theory while addressing its limitations. Our experimental results demonstrate that GIB consistently exhibits compression phases across a wide range of architectures (including those with \textit{ReLU} activations where the standard IB fails), while yielding interpretable dynamics in both CNNs and Transformers and aligning more closely with our understanding of adversarial robustness.

A Generalized Information Bottleneck Theory of Deep Learning

TL;DR

The paper tackles theoretical ambiguities in the Information Bottleneck (IB) framework by proposing a Generalized Information Bottleneck (GIB) that foregrounds synergy among input features. It introduces a PMI-based reweighting and a feature-wise synergy decomposition using interaction information, and proves that, under perfect estimation, the classical IB objective is bounded by the GIB objective, while also resolving issues like infinite compression. Empirically, GIB yields consistent compression phases across activations and architectures, and its complexity term aligns with adversarial robustness, offering interpretable learning dynamics in CNNs and Transformers. The work demonstrates that synergistic feature processing improves generalization and provides a practical, more complete lens for analyzing deep learning representations with potential implications for robust and transferable models.

Abstract

The Information Bottleneck (IB) principle offers a compelling theoretical framework to understand how neural networks (NNs) learn. However, its practical utility has been constrained by unresolved theoretical ambiguities and significant challenges in accurate estimation. In this paper, we present a \textit{Generalized Information Bottleneck (GIB)} framework that reformulates the original IB principle through the lens of synergy, i.e., the information obtainable only through joint processing of features. We provide theoretical and empirical evidence demonstrating that synergistic functions achieve superior generalization compared to their non-synergistic counterparts. Building on these foundations we re-formulate the IB using a computable definition of synergy based on the average interaction information (II) of each feature with those remaining. We demonstrate that the original IB objective is upper bounded by our GIB in the case of perfect estimation, ensuring compatibility with existing IB theory while addressing its limitations. Our experimental results demonstrate that GIB consistently exhibits compression phases across a wide range of architectures (including those with \textit{ReLU} activations where the standard IB fails), while yielding interpretable dynamics in both CNNs and Transformers and aligning more closely with our understanding of adversarial robustness.

Paper Structure

This paper contains 65 sections, 3 theorems, 16 equations, 16 figures, 1 table.

Key Result

Theorem 1

If we assume perfect training accuracy and therefore $Q(Z,Y) = Z = Y$, then the original IB objective is upper bounded by our GIB:

Figures (16)

  • Figure 1: This schematic illustrates information plane dynamics during training, with trajectories color-coded from early epochs (light colors) to late epochs (dark purple), showing distinct fitting and compression phases.
  • Figure 2: Information plane dynamics across multiple activation functions, extending shwartz2017opening and saxe2018information beyond tanh and ReLU to include softplus, swish, and leaky ReLU. Standard IB (blue) shows compression only for tanh; GIB (pink) shows compression for all activation functions. Each column represents one seed.
  • Figure 3: Synergistic processing of noise enhances generalization. (a) Controlled synthetic experiment demonstrating how synergy affects information flow (see Appendix \ref{['app:synthetic']} for details). Three functions of increasing synergy process binary inputs with noise: non-synergistic (blue), partially synergistic (green), and highly synergistic (magenta). Left: $I(f(X,\varepsilon); \varepsilon)$: more synergistic functions have lower dependence on noise. Right: $I(f(X,\varepsilon); X)$ - we observe that synergistic functions have lower MI with the input. (b) Empirical validation on CIFAR-10 using ResNets of varying depths (see Appendix \ref{['app:cifar_synergy']}). We quantify synergistic interactions between inputs and noise as $I(f(X,\varepsilon); \varepsilon|X) / I(f(X,\varepsilon); X,\varepsilon)$. Higher synergy correlates with smaller generalization gaps.
  • Figure 4: Information plane dynamics for NNs learning simple mathematical functions. Comparison of standard IB versus GIB across five functions (rows) and five random seeds. Functions include basic arithmetic and symmetric polynomials. GIB consistently shows compression phases (leftward movement), while standard IB exhibits varied behaviors. See Appendix \ref{['app:simple_functions']} for experimental details.
  • Figure 5: Information plane dynamics for ResNets of varying depths trained on CIFAR-10. Comparison across four network depths and five random seeds. GIB consistently exhibits compression phases, while standard IB shows limited or no compression. See Appendix \ref{['app:resnets']} for details.
  • ...and 11 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2
  • proof : Proof
  • Theorem 3
  • proof