Efficient compression of neural networks and datasets
Lukas Silvester Barth, Paulo von Petersenn
TL;DR
The paper tackles the problem of reducing neural network and data description length without sacrificing test performance by unifying MDL/algorithmic-information principles with practical, gradient-friendly optimization. It introduces tractable formulations for $\ell_0$ regularization, including probabilistic minimax pruning (PMMP), smooth relaxations, Random Gradient Pruning, and threshold-adaptive masking (TAMADE), and validates them across image classifiers and Transformer models. The results show substantial compression with maintained or improved accuracy, and they reveal that regularization can enhance sample efficiency, consistent with Solomonoff induction predictions. The work provides a broadly applicable framework for model and data compression, offering actionable methods and publicly available code to advance efficient deployment and data processing.
Abstract
We compare, improve, and contribute methods that substantially decrease the number of parameters of neural networks while maintaining high test accuracy. When applying our methods to minimize description length, we obtain very effective data compression algorithms. In particular, we develop a probabilistic reformulation of $\ell_0$ regularized optimization for nonlinear models that does not require Monte-Carlo sampling and thus improves upon previous methods. We also improve upon methods involving smooth approximations to the $\ell_0$ norm, and investigate layerwise methods. We compare the methods on different architectures and datasets, including convolutional networks trained on image datasets and transformers trained on parts of Wikipedia. We also created a synthetic teacher-student setup to investigate compression in a controlled continuous setting. Finally, we conceptually relate compression algorithms to Solomonoff's theory of inductive inference and empirically verify the prediction that regularized models can exhibit more sample-efficient convergence.
