Table of Contents
Fetching ...

Mutual Information Preserving Neural Network Pruning

Charles Westphal, Stephen Hailes, Mirco Musolesi

TL;DR

MIPP introduces a principled activation-based pruning method that preserves mutual information (MI) between adjacent layer activations, enabling retrainable pruned networks whether pruning occurs before or after training. It leverages the Transfer Entropy Redundancy Criterion (TERC) with MI ordering to dynamically remove non-transferent neurons while maintaining a bijective mapping between upstream and downstream activations, yielding an upper bound on the data-mask MI and improving sample efficiency. Theoretical results establish retrainability and MI bounds, while empirical evaluations on MNIST and CIFAR datasets show strong performance and reduced layer-collapse across a broad set of architectures, outperforming state-of-the-art baselines in many high-sparsity regimes. The approach demonstrates that MI-preserving pruning can effectively compress models with minimal performance loss and offers practical insights into feature and network compression, with code forthcoming.

Abstract

Pruning has emerged as the primary approach used to limit the resource requirements of large neural networks (NNs). Since the proposal of the lottery ticket hypothesis, researchers have focused either on pruning at initialization or after training. However, recent theoretical findings have shown that the sample efficiency of robust pruned models is proportional to the mutual information (MI) between the pruning masks and the model's training datasets, \textit{whether at initialization or after training}. In this paper, starting from these results, we introduce Mutual Information Preserving Pruning (MIPP), a structured activation-based pruning technique applicable before or after training. The core principle of MIPP is to select nodes in a way that conserves MI shared between the activations of adjacent layers, and consequently between the data and masks. Approaching the pruning problem in this manner means we can prove that there exists a function that can map the pruned upstream layer's activations to the downstream layer's, implying re-trainability. We demonstrate that MIPP consistently outperforms state-of-the-art methods, regardless of whether pruning is performed before or after training.

Mutual Information Preserving Neural Network Pruning

TL;DR

MIPP introduces a principled activation-based pruning method that preserves mutual information (MI) between adjacent layer activations, enabling retrainable pruned networks whether pruning occurs before or after training. It leverages the Transfer Entropy Redundancy Criterion (TERC) with MI ordering to dynamically remove non-transferent neurons while maintaining a bijective mapping between upstream and downstream activations, yielding an upper bound on the data-mask MI and improving sample efficiency. Theoretical results establish retrainability and MI bounds, while empirical evaluations on MNIST and CIFAR datasets show strong performance and reduced layer-collapse across a broad set of architectures, outperforming state-of-the-art baselines in many high-sparsity regimes. The approach demonstrates that MI-preserving pruning can effectively compress models with minimal performance loss and offers practical insights into feature and network compression, with code forthcoming.

Abstract

Pruning has emerged as the primary approach used to limit the resource requirements of large neural networks (NNs). Since the proposal of the lottery ticket hypothesis, researchers have focused either on pruning at initialization or after training. However, recent theoretical findings have shown that the sample efficiency of robust pruned models is proportional to the mutual information (MI) between the pruning masks and the model's training datasets, \textit{whether at initialization or after training}. In this paper, starting from these results, we introduce Mutual Information Preserving Pruning (MIPP), a structured activation-based pruning technique applicable before or after training. The core principle of MIPP is to select nodes in a way that conserves MI shared between the activations of adjacent layers, and consequently between the data and masks. Approaching the pruning problem in this manner means we can prove that there exists a function that can map the pruned upstream layer's activations to the downstream layer's, implying re-trainability. We demonstrate that MIPP consistently outperforms state-of-the-art methods, regardless of whether pruning is performed before or after training.

Paper Structure

This paper contains 30 sections, 6 equations, 10 figures, 2 tables, 2 algorithms.

Figures (10)

  • Figure 1: We introduce MIPP via an illustration. MIPP acts to preserve the mutual information (MI) between the activations in adjacent layers. In turn, this leads to a pruned network representation whose nodes and mask replicate the information contained in the data.
  • Figure 2: a) Graphical representation of how MI between the mask and the data affects the accuracy of a small convolution-based and standard NN: we observe that by maximizing MI, the classification accuracy increases. The experiments are based on synthetic data; for full details refer to Appendix \ref{['app:linei']}. b) A study examining how pruning masks, created using various PaI methods and applied to a small synthetic network, affect the values of $I(\mathcal{D};\mathcal{M})$. For full details about these experiments, please refer to Appendix \ref{['app:whatpai']}. c) Comparison of MIPP's average accuracy across different sparsity ratios to the best-performing baseline for each model-dataset combination. MIPP outperforms the best of the rest significantly, as at high sparsities, they are all much more prone to layer collapse. PaT baselines: OTO, IMP, SOSP-H, ThiNet. PaI baselines: IterSNIP, IterGrasP, ProsPr, SynFlow.
  • Figure 3: Top. Deforming MNIST for increased image complexity. These transformations were applied randomly with equal probability and then kept consistent during training, pruning, and re-training. Bottom. Changes in pruning ability of MIPP caused by image deformation.
  • Figure 4: Comaprison of MIPP's ability to prune versus baselines both at initialization and after training. For clarity, we set an accuracy range to avoid viewing data points in which layer collapse has occurred.
  • Figure 5: The percentage of runs that led to untrainable layer collapse. Specifically, we bin runs by the percentage of neurons removed, where one bin contains all the runs within a 5% increment. We then calculate the percentage of these runs that lead to layer collapse.
  • ...and 5 more figures