Table of Contents
Fetching ...

Estimating Information Flow in Deep Neural Networks

Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, Yury Polyanskiy

TL;DR

This work challenges the traditional Information Bottleneck interpretation of deep learning by highlighting that mutual information I(X;T) is ill-posed for deterministic networks. It proposes a noisy DNN framework with Gaussian layerwise noise to render I(X;T) meaningful, and develops a rigorous, forward-pass sampling estimator based on differential entropy of Gaussian mixtures. The authors show that observed compression during training arises from progressive clustering of inputs from the same class in hidden representations, not from saturation alone, and that binning-based MI in deterministic networks mostly tracks this clustering. Across SZT models and a MNIST CNN, clustering emerges as the core geometric phenomenon guiding representation learning, while regularization that limits clustering can mitigate compression and influence generalization. The results advocate shifting focus from MI values to clustering dynamics as a more informative and actionable lens for understanding and regularizing deep representations.

Abstract

We study the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information $I(X;T)$ between the input $X$ and internal representations $T$ decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true $I(X;T)$ over these networks is provably either constant (discrete $X$) or infinite (continuous $X$). This work explains the discrepancy between theory and experiments, and clarifies what was actually measured by these past works. To this end, we introduce an auxiliary (noisy) DNN framework for which $I(X;T)$ is a meaningful quantity that depends on the network's parameters. This noisy framework is shown to be a good proxy for the original (deterministic) DNN both in terms of performance and the learned representations. We then develop a rigorous estimator for $I(X;T)$ in noisy DNNs and observe compression in various models. By relating $I(X;T)$ in the noisy DNN to an information-theoretic communication problem, we show that compression is driven by the progressive clustering of hidden representations of inputs from the same class. Several methods to directly monitor clustering of hidden representations, both in noisy and deterministic DNNs, are used to show that meaningful clusters form in the $T$ space. Finally, we return to the estimator of $I(X;T)$ employed in past works, and demonstrate that while it fails to capture the true (vacuous) mutual information, it does serve as a measure for clustering. This clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest.

Estimating Information Flow in Deep Neural Networks

TL;DR

This work challenges the traditional Information Bottleneck interpretation of deep learning by highlighting that mutual information I(X;T) is ill-posed for deterministic networks. It proposes a noisy DNN framework with Gaussian layerwise noise to render I(X;T) meaningful, and develops a rigorous, forward-pass sampling estimator based on differential entropy of Gaussian mixtures. The authors show that observed compression during training arises from progressive clustering of inputs from the same class in hidden representations, not from saturation alone, and that binning-based MI in deterministic networks mostly tracks this clustering. Across SZT models and a MNIST CNN, clustering emerges as the core geometric phenomenon guiding representation learning, while regularization that limits clustering can mitigate compression and influence generalization. The results advocate shifting focus from MI values to clustering dynamics as a more informative and actionable lens for understanding and regularizing deep representations.

Abstract

We study the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information between the input and internal representations decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true over these networks is provably either constant (discrete ) or infinite (continuous ). This work explains the discrepancy between theory and experiments, and clarifies what was actually measured by these past works. To this end, we introduce an auxiliary (noisy) DNN framework for which is a meaningful quantity that depends on the network's parameters. This noisy framework is shown to be a good proxy for the original (deterministic) DNN both in terms of performance and the learned representations. We then develop a rigorous estimator for in noisy DNNs and observe compression in various models. By relating in the noisy DNN to an information-theoretic communication problem, we show that compression is driven by the progressive clustering of hidden representations of inputs from the same class. Several methods to directly monitor clustering of hidden representations, both in noisy and deterministic DNNs, are used to show that meaningful clusters form in the space. Finally, we return to the estimator of employed in past works, and demonstrate that while it fails to capture the true (vacuous) mutual information, it does serve as a measure for clustering. This clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest.

Paper Structure

This paper contains 28 sections, 7 theorems, 44 equations, 14 figures, 1 table.

Key Result

Theorem 1

Fix $\ell\in[L-1]$ and assume $\|f_\ell\|_\infty\leq 1$. For $\hat{I}_\mathsf{SP}$ from EQ:MI_estimator_final with $n=n_x$, for all $x\in\mathcal{X}$, we have where $c$ is a numerical constant explicitly given in Appendix SUPPSUBSEC:risk_bounds.

Figures (14)

  • Figure 1: $I(X;\mathsf{Bin}(T_\ell))$ vs. epochs for different bin sizes and the model in DNNs_Tishby2017, where $X$ is uniformly distributed over a $2^{12}$-sized empirical dataset. The curves converge to $H(X)=\ln(2^{12})\approx 8.3$ for small bins.
  • Figure 2: $k$th noisy neuron in layer $\ell$: $\mathrm{W}_\ell^{(k)}$ and $b_\ell(k)$ are the $k$th row/entry of the weight matrix and the bias, respectively.
  • Figure 3: Cosine similarity histograms between internal representations of deterministic, noisy, and dropout MNIST CNNs.
  • Figure 4: Single-layer tanh network: (a) the density $p_{T(k)}$ at epochs $k=250,2500$; (b) $p_{T(k)}$ and (c) $I(X; T(k))$ as a function of $k$; and (d) mutual information as a function of $k$, for different $\beta$ values..
  • Figure 5: (a) Evolution of $I(X;T_\ell)$ and training/test losses across training epochs for the SZT model with $\beta= 0.005$ and tanh nonlinearities. The scatter plots show the values of Layer 5 ($d_5\mspace{-2mu}=\mspace{-2mu} 3$) at the arrow-marked epochs on the mutual information plot. The bottom plot shows $H(\mathsf{Bin}(T_\ell))$ across epochs for bin size $B\mspace{-3mu}=\mspace{-3mu}10\beta$. (b) Same setup as in (a) but with regularization that encourages orthonormal weight matrices. (c) SZT model with $\beta= 0.01$ and linear activations.
  • ...and 9 more figures

Theorems & Definitions (12)

  • Theorem 1
  • Theorem 2: Theorem 1 from anonymized_ISIT_estimation2019
  • Remark 1: Critical $\bm{\beta}$ Values
  • Theorem 3: Theorem 2 from anonymized_ISIT_estimation2019
  • Remark 2: Improved Constant for Bounded Support
  • Remark 3: Comparison to Generic Estimators
  • Theorem 4
  • Theorem 5
  • Theorem 6: MSE Bounds for MC Estimator
  • Remark 4: Comparison to Generic Entropy Estimation
  • ...and 2 more