Rotation Invariant Quantization for Model Compression
Joseph Kampeas, Yury Nahshan, Hanoch Kremer, Gil Lederman, Shira Zaloshinski, Zheng Li, Emir Haleva
TL;DR
The paper addresses rate-distortion with post-training neural network quantization by introducing Rotation-Invariant Quantization (RIQ), which uses a single parameter to control layer-wise mixed-precision quantization while measuring distortion with a cosine-based metric. It proves that the rate-distortion optimum under rotation-invariant distortions is achieved by a product of layer-wise spherical distributions, enabling a scalar control (k) to govern the entire model's entropy under a fixed deviation D. The authors provide a practical algorithm (RIQ) that computes per-layer bin widths Δℓ(k) proportional to layer norms, with an efficient bounded search for the optimal k, and they validate the approach with extensive experiments across vision, NLP, and multi-task models, showing substantial compression with minimal accuracy loss when combined with entropy coding (ANS). The work demonstrates that rotation-invariant, mixed-precision post-training quantization can approach theoretical rate-distortion limits, enabling hardware-friendly, fast, and scalable model compression with open-source code.
Abstract
Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources. In this study, we investigate the rate-distortion tradeoff for NN model compression. First, we suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model, yielding a different rate at each layer, i.e., mixed-precision quantization. Then, we prove that our rotation-invariant approach is optimal in terms of compression. We rigorously evaluate RIQ and demonstrate its capabilities on various models and tasks. For example, RIQ facilitates $\times 19.4$ and $\times 52.9$ compression ratios on pre-trained VGG dense and pruned models, respectively, with $<0.4\%$ accuracy degradation. Code is available in \href{https://github.com/ehaleva/RIQ}{github.com/ehaleva/RIQ}.
