Table of Contents
Fetching ...

Efficient compression of neural networks and datasets

Lukas Silvester Barth, Paulo von Petersenn

TL;DR

The paper tackles the problem of reducing neural network and data description length without sacrificing test performance by unifying MDL/algorithmic-information principles with practical, gradient-friendly optimization. It introduces tractable formulations for $\ell_0$ regularization, including probabilistic minimax pruning (PMMP), smooth relaxations, Random Gradient Pruning, and threshold-adaptive masking (TAMADE), and validates them across image classifiers and Transformer models. The results show substantial compression with maintained or improved accuracy, and they reveal that regularization can enhance sample efficiency, consistent with Solomonoff induction predictions. The work provides a broadly applicable framework for model and data compression, offering actionable methods and publicly available code to advance efficient deployment and data processing.

Abstract

We compare, improve, and contribute methods that substantially decrease the number of parameters of neural networks while maintaining high test accuracy. When applying our methods to minimize description length, we obtain very effective data compression algorithms. In particular, we develop a probabilistic reformulation of $\ell_0$ regularized optimization for nonlinear models that does not require Monte-Carlo sampling and thus improves upon previous methods. We also improve upon methods involving smooth approximations to the $\ell_0$ norm, and investigate layerwise methods. We compare the methods on different architectures and datasets, including convolutional networks trained on image datasets and transformers trained on parts of Wikipedia. We also created a synthetic teacher-student setup to investigate compression in a controlled continuous setting. Finally, we conceptually relate compression algorithms to Solomonoff's theory of inductive inference and empirically verify the prediction that regularized models can exhibit more sample-efficient convergence.

Efficient compression of neural networks and datasets

TL;DR

The paper tackles the problem of reducing neural network and data description length without sacrificing test performance by unifying MDL/algorithmic-information principles with practical, gradient-friendly optimization. It introduces tractable formulations for regularization, including probabilistic minimax pruning (PMMP), smooth relaxations, Random Gradient Pruning, and threshold-adaptive masking (TAMADE), and validates them across image classifiers and Transformer models. The results show substantial compression with maintained or improved accuracy, and they reveal that regularization can enhance sample efficiency, consistent with Solomonoff induction predictions. The work provides a broadly applicable framework for model and data compression, offering actionable methods and publicly available code to advance efficient deployment and data processing.

Abstract

We compare, improve, and contribute methods that substantially decrease the number of parameters of neural networks while maintaining high test accuracy. When applying our methods to minimize description length, we obtain very effective data compression algorithms. In particular, we develop a probabilistic reformulation of regularized optimization for nonlinear models that does not require Monte-Carlo sampling and thus improves upon previous methods. We also improve upon methods involving smooth approximations to the norm, and investigate layerwise methods. We compare the methods on different architectures and datasets, including convolutional networks trained on image datasets and transformers trained on parts of Wikipedia. We also created a synthetic teacher-student setup to investigate compression in a controlled continuous setting. Finally, we conceptually relate compression algorithms to Solomonoff's theory of inductive inference and empirically verify the prediction that regularized models can exhibit more sample-efficient convergence.

Paper Structure

This paper contains 44 sections, 8 theorems, 52 equations, 9 figures, 3 tables, 1 algorithm.

Key Result

Theorem 2.1

Given any $\mu\in\mathcal{M}$, let $\xi(x)$ be a (semi-) probability distribution that satisfies $\xi(x) \ge w(\mu)~\mu(x)$ for some fixed $w(\mu)$ and all $x$. Then, for all $n\in\mathbb{N}$,

Figures (9)

  • Figure 1: 'Test Loss' vs 'Model Byte Size' for different dataset sizes in the teacher-student setup. Description length (in bytes) isolines are color coded. Curves and error bars were computed as described in Appendix \ref{['app:TeacherStudentResultSummary']}. Teacher and student architectures have layer sizes [2, 5, 8, 1] and [2, 25, 25, 1], respectively. The data was sampled from the teacher network with standard deviation $\sigma=0.08$.
  • Figure 2: Mean loss vs model size with description length (in MB) isolines for the transformer described in Section \ref{['sec:transformerExperiments']}. From left to right, the 3 images correspond to dataset sizes of 16MB, 50MB and 300MB respectively. The solid lines correspond to the three different regularization methods PMMP, DRR and R-L1 described in Section \ref{['sec:methods']}. The crosses correspond to transformers of the indicated model size trained without regularization.
  • Figure 3: Weights in a simple neural network before (left) and after (right) random gradient pruning. (Line thickness and color indicate connection strength and sign of weights. Circle colors indicate strength of biases.)
  • Figure 4: Effect of $\ell_2$-regularization strength ($\rho$) on performance across datasets, visualized via $\alpha$-Hull plots. Each dot corresponds to a different hyperparameter configuration. Solid lines denote the optimal trade-off frontiers. Compression is measured as the ratio of uncompressed to regularized model size. Color encodes $\rho$: yellow (highest), pink (intermediate), and green ($\rho = 0$). (a) MNIST (LeNet-300-100), (b) CIFAR-10 (MLP with three 512-d hidden layers), and (c) Teacher-Student setting with teacher [2, 5, 5, 1] and student [2, 25, 25, 1]. In (c), inverse test MSE is used (higher is better); the dashed vertical line indicates teacher model size.
  • Figure 5: 'Accuracy' vs 'Model Size Compression Rate' for different datasets and models.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Theorem 2.1
  • Proposition 2.2
  • proof
  • Lemma 2.3
  • proof
  • Corollary 2.4
  • Lemma A.1
  • proof
  • Proposition B.1
  • Lemma B.2
  • ...and 1 more