Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression

Ivan Butakov; Alexander Tolmachev; Sofia Malanchuk; Anna Neopryatnaya; Alexey Frolov; Kirill Andreev

Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression

Ivan Butakov, Alexander Tolmachev, Sofia Malanchuk, Anna Neopryatnaya, Alexey Frolov, Kirill Andreev

TL;DR

This work tackles the challenge of applying Information Bottleneck analysis to deep networks by introducing a compression-based mutual information estimator that operates on latent representations. It leverages stochastic neural network dynamics and explicit lossy compression (via autoencoders and PCA) to estimate $I(X;L)$ and $I(Y;L)$, supported by two-sided entropy bounds that quantify the deviation introduced by compression. The authors provide a synthetic data framework to benchmark MI estimators, compare several entropy estimators, and demonstrate the approach on a MNIST convolutional classifier, revealing multiple layer-wise fitting/compression phases and linking the first compression phase to rapid loss reduction. The method offers a practical IB diagnostic tool for real networks and suggests avenues for improved regularization and architecture search, while pointing to future enhancements such as invertible normalizing flows for lossless compression.

Abstract

The Information Bottleneck (IB) principle offers an information-theoretic framework for analyzing the training process of deep neural networks (DNNs). Its essence lies in tracking the dynamics of two mutual information (MI) values: between the hidden layer output and the DNN input/target. According to the hypothesis put forth by Shwartz-Ziv & Tishby (2017), the training process consists of two distinct phases: fitting and compression. The latter phase is believed to account for the good generalization performance exhibited by DNNs. Due to the challenging nature of estimating MI between high-dimensional random vectors, this hypothesis was only partially verified for NNs of tiny sizes or specific types, such as quantized NNs. In this paper, we introduce a framework for conducting IB analysis of general NNs. Our approach leverages the stochastic NN method proposed by Goldfeld et al. (2019) and incorporates a compression step to overcome the obstacles associated with high dimensionality. In other words, we estimate the MI between the compressed representations of high-dimensional random vectors. The proposed method is supported by both theoretical and practical justifications. Notably, we demonstrate the accuracy of our estimator through synthetic experiments featuring predefined MI values and comparison with MINE (Belghazi et al., 2018). Finally, we perform IB analysis on a close-to-real-scale convolutional DNN, which reveals new features of the MI dynamics.

Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression

TL;DR

and

, supported by two-sided entropy bounds that quantify the deviation introduced by compression. The authors provide a synthetic data framework to benchmark MI estimators, compare several entropy estimators, and demonstrate the approach on a MNIST convolutional classifier, revealing multiple layer-wise fitting/compression phases and linking the first compression phase to rapid loss reduction. The method offers a practical IB diagnostic tool for real networks and suggests avenues for improved regularization and architecture search, while pointing to future enhancements such as invertible normalizing flows for lossless compression.

Abstract

Paper Structure (22 sections, 4 theorems, 46 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 22 sections, 4 theorems, 46 equations, 6 figures, 4 tables, 2 algorithms.

Introduction
Preliminaries
Mutual information estimation via compression
Mutual information estimation
Bounds for mutual information estimate
Synthetic dataset generation
Comparison of the entropy estimators
Information flow in deep neural networks
Discussion
Complete proofs
Entropy bounds
Limitations of previous works
Limitations of other estimators
Classical entropy estimators
Kernel density estimation
...and 7 more sections

Key Result

Corollary 1

Let $\xi_1$, $\xi_2$ and $\eta_1$, $\eta_2$ be random variables, independent in the following tuples: $(\xi_1, \xi_2)$, $(\eta_1, \eta_2)$ and $\left((\xi_1, \eta_1), (\xi_2, \eta_2) \right)$. Then $I \left((\xi_1, \xi_2); (\eta_1, \eta_2) \right) = I(\xi_1; \eta_1) + I(\xi_2; \eta_2)$

Figures (6)

Figure 1: Conceptual scheme of Statement \ref{['statement:information_bounds_with_denoising']} in application to lossy compression with autoencoder $A = D \circ E$.
Figure 2: Conceptual scheme of Algorithm \ref{['alg:main_algo']}. In order to observe and quantify the loss of information caused by the compression step, we split $f \colon \mathbb{R}^{n'} \to \mathbb{R}^{n}$ into two functions: $f_1 \colon \mathbb{R}^{n'} \to \mathbb{R}^{n'}$ maps $\xi$ to a structured latent representation of $X$ (e.g., parameters of geometric shapes), and $f_2 \colon \mathbb{R}^{n'} \to \mathbb{R}^n$ maps latent representations to corresponding high-dimensional vectors (e.g., rasterized images of geometric shapes). The same goes for $g = g_2 \circ g_1$. Colors correspond to the Figures \ref{['fig:compare_methods_gauss']} and \ref{['fig:compare_methods_rectangles']}. For a proper experimental setup, we require $f_1, f_2, g_1, g_2$ to satisfy the conditions of Statement \ref{['statement:MI_under_nonsingular_mappings']}.
Figure 3: Maximum-likelihood and Least Squares Error KDE, Non-weighted and Weighted Kozachenko-Leonenko, MINE for $16 \times 16$ (first row) and $32 \times 32$ (second row) images of 2D Gaussians ($n' = m' = 2$), $5 \cdot 10^3$ samples. Along $x$ axes is $I(X;Y)$, along $y$ axes is $\hat{I}(X;Y)$.
Figure 4: Maximum-likelihood and Least Squares Error KDE, Non-weighted and Weighted Kozachenko-Leonenko, MINE for $16 \times 16$ (first row) and $32 \times 32$ (second row) images of rectangles ($n' = m' = 4$), $5 \cdot 10^3$ samples. Along $x$ axes is $I(X;Y)$, along $y$ axes is $\hat{I}(X;Y)$.
Figure 5: Information plane plots for the MNIST classifier. The lower left parts of the plots (b)-(d) correspond to the first epochs. We use 95% asymptotic CIs for the MI estimates acquired from the compressed data. The colormap represents the difference of losses between two consecutive epochs.
...and 1 more figures

Theorems & Definitions (15)

Remark 1
Definition 1
Definition 2
proof
proof
Corollary 1
proof
proof
proof
Corollary 2
...and 5 more

Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression

TL;DR

Abstract

Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (15)