Table of Contents
Fetching ...

Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding

Alexander Conzelmann, Robert Bamler

TL;DR

A novel post-training compression framework that combines rate-aware quantization with entropy coding by extending the well-known layer-wise loss by a quadratic rate estimation and providing locally exact solutions to this modified objective following the Optimal Brain Surgeon method is proposed.

Abstract

The ever-growing size of neural networks poses serious challenges on resource-constrained devices, such as embedded sensors. Compression algorithms that reduce their size can mitigate these problems, provided that model performance stays close to the original. We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding by (1) extending the well-known layer-wise loss by a quadratic rate estimation, and (2) providing locally exact solutions to this modified objective following the Optimal Brain Surgeon (OBS) method. Our method allows for very fast decoding and is compatible with arbitrary quantization grids. We verify our results empirically by testing on various computer-vision networks, achieving a 20-40\% decrease in bit rate at the same performance as the popular compression algorithm NNCodec. Our code is available at https://github.com/Conzel/cerwu.

Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding

TL;DR

A novel post-training compression framework that combines rate-aware quantization with entropy coding by extending the well-known layer-wise loss by a quadratic rate estimation and providing locally exact solutions to this modified objective following the Optimal Brain Surgeon method is proposed.

Abstract

The ever-growing size of neural networks poses serious challenges on resource-constrained devices, such as embedded sensors. Compression algorithms that reduce their size can mitigate these problems, provided that model performance stays close to the original. We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding by (1) extending the well-known layer-wise loss by a quadratic rate estimation, and (2) providing locally exact solutions to this modified objective following the Optimal Brain Surgeon (OBS) method. Our method allows for very fast decoding and is compatible with arbitrary quantization grids. We verify our results empirically by testing on various computer-vision networks, achieving a 20-40\% decrease in bit rate at the same performance as the popular compression algorithm NNCodec. Our code is available at https://github.com/Conzel/cerwu.

Paper Structure

This paper contains 37 sections, 32 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Performance of our compression methods on various networks trained and evaluated on ImageNet. For better visibility, a Pareto front over the parameters was calculated for each curve. The inset at the bottom right of each plot shows a zoomed-in version of the area marked in the red bounding box. The inset ranges over $(0.95 \cdot \text{acc}_{\text{orig.}}, 1.0125 \cdot \text{acc}_{\text{orig.}})$ on the y-axis and encompasses a 1.5-bits-per-weight range in the x-axis. Our proposed methods and ablations are marked with solid lines, baselines are marked in dashed lines, and the original performance of the (uncompressed) network is marked with a horizontal, gray dashed line. The plot titles include the number of quantizable parameters each network has (multiply by 32 bit to get the uncompressed storage size on disk).
  • Figure 2: Performance of compression methods for CIFAR10-trained networks, analogous to \ref{['fig:lines-imagenet']}.
  • Figure 3: Minimum bits per weight achieved at 99% (left) and 95% (right) of the original test accuracy for different methods. Lower is better. Both CERWU (blue) and CERWU-$\gamma\!=\!0$ (orange) outperform all other methods, with CERWU achieving a slight edge over CERWU-$\gamma\!=\!0$ for most networks.
  • Figure 4: Run times for compressing ResNets of differing sizes. Left: run times for the first run; Right: run times for subsequent runs for different values of $\lambda$ or $k$ (which can reuse the Hessian).
  • Figure 5: Rate-distortion performance of our methods on the small language model Pythia-70M. We swept over grid sizes of {4, 16, 128, 256, 352, 512, 768, 1024, 2048}.
  • ...and 1 more figures