Information Plane Analysis Visualization in Deep Learning via Transfer Entropy

Adrian Moldovan; Angel Cataron; Razvan Andonie

Information Plane Analysis Visualization in Deep Learning via Transfer Entropy

Adrian Moldovan, Angel Cataron, Razvan Andonie

TL;DR

This paper addresses how information flows and compresses in deep networks during training and whether compression relates to generalization. It introduces Transfer Entropy ($TE$) to quantify directional, layer-to-layer information transfer and integrates it with Information Plane (IP) analysis by binarizing activations to compute $TE$ between adjacent layers. The main contributions include the first application of $TE$ to investigate the Information Bottleneck (IB) principle in neural networks, demonstrated on shallow and CNN architectures where $TE$ concentrates in final layers, decreases during training, and correlates with accuracy and loss, suggesting TE as a layer-wise proxy for compression and a diagnostic for learning dynamics. The approach offers temporally aware insights into learning dynamics and points to potential TE-guided training strategies or regularization to improve efficiency and generalization in deep networks.

Abstract

In a feedforward network, Transfer Entropy (TE) can be used to measure the influence that one layer has on another by quantifying the information transfer between them during training. According to the Information Bottleneck principle, a neural model's internal representation should compress the input data as much as possible while still retaining sufficient information about the output. Information Plane analysis is a visualization technique used to understand the trade-off between compression and information preservation in the context of the Information Bottleneck method by plotting the amount of information in the input data against the compressed representation. The claim that there is a causal link between information-theoretic compression and generalization, measured by mutual information, is plausible, but results from different studies are conflicting. In contrast to mutual information, TE can capture temporal relationships between variables. To explore such links, in our novel approach we use TE to quantify information transfer between neural layers and perform Information Plane analysis. We obtained encouraging experimental results, opening the possibility for further investigations.

Information Plane Analysis Visualization in Deep Learning via Transfer Entropy

TL;DR

This paper addresses how information flows and compresses in deep networks during training and whether compression relates to generalization. It introduces Transfer Entropy (

) to quantify directional, layer-to-layer information transfer and integrates it with Information Plane (IP) analysis by binarizing activations to compute

between adjacent layers. The main contributions include the first application of

to investigate the Information Bottleneck (IB) principle in neural networks, demonstrated on shallow and CNN architectures where

concentrates in final layers, decreases during training, and correlates with accuracy and loss, suggesting TE as a layer-wise proxy for compression and a diagnostic for learning dynamics. The approach offers temporally aware insights into learning dynamics and points to potential TE-guided training strategies or regularization to improve efficiency and generalization in deep networks.

Abstract

Paper Structure (10 sections, 2 equations, 6 figures, 1 table)

This paper contains 10 sections, 2 equations, 6 figures, 1 table.

Introduction
Related Work
Transfer Entropy in Neural Networks
Information Bottleneck in Neural Networks
Information Bottleneck using Transfer Entropy
Experiments
Shallow architectures
CNN architecture
Discussion
Conclusion and Future Work

Figures (6)

Figure 1: Schematic of the three layer feedforward neural network. Each pair of layers contribute to obtaining a vector of TE values, illustrated with red circles, while green lines show the associated neuron pairs that produce the actual TE value.
Figure 2: Showing TE for a single hidden layer feedforward network. Each TE value plotted is the average TE for all training samples across all epochs. Epochs are 'stacked' along the x axis. Observing glass dataset results, we notice this is a hard to solve problem for single hidden layer networks, as it can be observed in Table \ref{['datasets']}; 'TE Output' evolution is slow and has small variance.
Figure 3: Showing TE calculated between input and hidden layer, for each training sample, for every 4th epoch on the Ionosphere dataset. The number of epochs has been reduced to improve readability.
Figure 4: Averaged by sample TEs for multiple networks for the last pair of layers (linear and softmax).
Figure 5: Overtraining a fully connected network with 3 hidden layers on the Iris dataset while showing TE for, input to hidden layer, labeled as 'TE input', hidden to hidden and hidden to output marked with 'TE hidden' and 'TE output'. On the $x$ axis we plot the averaged normalized TE values for each training batch across all epochs.
...and 1 more figures

Information Plane Analysis Visualization in Deep Learning via Transfer Entropy

TL;DR

Abstract

Information Plane Analysis Visualization in Deep Learning via Transfer Entropy

Authors

TL;DR

Abstract

Table of Contents

Figures (6)