Table of Contents
Fetching ...

Pruning vs Quantization: Which is Better?

Andrey Kuzmin, Markus Nagel, Mart van Baalen, Arash Behboodi, Tijmen Blankevoort

TL;DR

The paper tackles whether pruning or quantization yields better accuracy under similar compression, aiming to guide hardware-aware design. It combines analytical error bounds, per-layer lower bounds, and extensive full-model experiments across distributions and real weight tensors to compare methods fairly. The findings show that quantization generally outperforms pruning at moderate compression, with pruning offering limited benefits only at very high compression or under extreme data tails. The study provides practical guidance for compression pipelines and highlights hardware implications, suggesting quantization be tried before pruning in efficiency-constrained scenarios.

Abstract

Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We provide an extensive comparison between the two techniques for compressing deep neural networks. First, we give an analytical comparison of expected quantization and pruning error for general data distributions. Then, we provide lower bounds for the per-layer pruning and quantization error in trained networks, and compare these to empirical error after optimization. Finally, we provide an extensive experimental comparison for training 8 large-scale models on 3 tasks. Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial from an accuracy standpoint.

Pruning vs Quantization: Which is Better?

TL;DR

The paper tackles whether pruning or quantization yields better accuracy under similar compression, aiming to guide hardware-aware design. It combines analytical error bounds, per-layer lower bounds, and extensive full-model experiments across distributions and real weight tensors to compare methods fairly. The findings show that quantization generally outperforms pruning at moderate compression, with pruning offering limited benefits only at very high compression or under extreme data tails. The study provides practical guidance for compression pipelines and highlights hardware implications, suggesting quantization be tried before pruning in efficiency-constrained scenarios.

Abstract

Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We provide an extensive comparison between the two techniques for compressing deep neural networks. First, we give an analytical comparison of expected quantization and pruning error for general data distributions. Then, we provide lower bounds for the per-layer pruning and quantization error in trained networks, and compare these to empirical error after optimization. Finally, we provide an extensive experimental comparison for training 8 large-scale models on 3 tasks. Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial from an accuracy standpoint.
Paper Structure (34 sections, 18 equations, 9 figures, 5 tables)

This paper contains 34 sections, 18 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison for a standard normal distribution. (left) Distributions after pruning and quantization for INT4 and 75% pruning. (middle) The squared error weighted by probability. (right) SNR for different compression ratios.
  • Figure 2: Comparing the error of pruning and quantization for a student-t distribution, simulating the presence of significant outliers. We plot the results for different magnitudes of the outliers, as per the kurtosis on the x-axis. (left) the pruning error, which does not change under the presence of more severe outliers. (middle) the quantization SNR, which is reduced greatly when outliers increase (right) the trade-off regions where quantization and pruning are better.
  • Figure 3: (left) Comparison on all the weights from PyTorch model zoo (46 models) combined with 3 large language models (Bloom-3b, Llama-3b, OPT-2.7b). (left) Pruning SNR versus quantization SNR for every tensor. (right) Pruning is preferable at high compression ratios for tensors with high sample kurtosis values.
  • Figure 4: Comparison in the post-training scenario. Each box corresponds to a subset of one of 10 layers from the 4 different models that were used, with 7 different bit-width comparison points. The ranges of the box indicate the lower and higher-bounds found by the algorithms.
  • Figure 5: Combining pruning and quantization on ImageNet models. The average bit-widths shown on x axis is computed as a product of the base bit-width and the density of non-zero weight elements. Different pruning ratios are applied to each base bitwidth model. Quantized models with only natural sparsity and no extra pruning are marked with crosses.
  • ...and 4 more figures