Mutual Information Preserving Neural Network Pruning
Charles Westphal, Stephen Hailes, Mirco Musolesi
TL;DR
MIPP introduces a principled activation-based pruning method that preserves mutual information (MI) between adjacent layer activations, enabling retrainable pruned networks whether pruning occurs before or after training. It leverages the Transfer Entropy Redundancy Criterion (TERC) with MI ordering to dynamically remove non-transferent neurons while maintaining a bijective mapping between upstream and downstream activations, yielding an upper bound on the data-mask MI and improving sample efficiency. Theoretical results establish retrainability and MI bounds, while empirical evaluations on MNIST and CIFAR datasets show strong performance and reduced layer-collapse across a broad set of architectures, outperforming state-of-the-art baselines in many high-sparsity regimes. The approach demonstrates that MI-preserving pruning can effectively compress models with minimal performance loss and offers practical insights into feature and network compression, with code forthcoming.
Abstract
Pruning has emerged as the primary approach used to limit the resource requirements of large neural networks (NNs). Since the proposal of the lottery ticket hypothesis, researchers have focused either on pruning at initialization or after training. However, recent theoretical findings have shown that the sample efficiency of robust pruned models is proportional to the mutual information (MI) between the pruning masks and the model's training datasets, \textit{whether at initialization or after training}. In this paper, starting from these results, we introduce Mutual Information Preserving Pruning (MIPP), a structured activation-based pruning technique applicable before or after training. The core principle of MIPP is to select nodes in a way that conserves MI shared between the activations of adjacent layers, and consequently between the data and masks. Approaching the pruning problem in this manner means we can prove that there exists a function that can map the pruned upstream layer's activations to the downstream layer's, implying re-trainability. We demonstrate that MIPP consistently outperforms state-of-the-art methods, regardless of whether pruning is performed before or after training.
