Rotation Invariant Quantization for Model Compression

Joseph Kampeas; Yury Nahshan; Hanoch Kremer; Gil Lederman; Shira Zaloshinski; Zheng Li; Emir Haleva

Rotation Invariant Quantization for Model Compression

Joseph Kampeas, Yury Nahshan, Hanoch Kremer, Gil Lederman, Shira Zaloshinski, Zheng Li, Emir Haleva

TL;DR

The paper addresses rate-distortion with post-training neural network quantization by introducing Rotation-Invariant Quantization (RIQ), which uses a single parameter to control layer-wise mixed-precision quantization while measuring distortion with a cosine-based metric. It proves that the rate-distortion optimum under rotation-invariant distortions is achieved by a product of layer-wise spherical distributions, enabling a scalar control (k) to govern the entire model's entropy under a fixed deviation D. The authors provide a practical algorithm (RIQ) that computes per-layer bin widths Δℓ(k) proportional to layer norms, with an efficient bounded search for the optimal k, and they validate the approach with extensive experiments across vision, NLP, and multi-task models, showing substantial compression with minimal accuracy loss when combined with entropy coding (ANS). The work demonstrates that rotation-invariant, mixed-precision post-training quantization can approach theoretical rate-distortion limits, enabling hardware-friendly, fast, and scalable model compression with open-source code.

Abstract

Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources. In this study, we investigate the rate-distortion tradeoff for NN model compression. First, we suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model, yielding a different rate at each layer, i.e., mixed-precision quantization. Then, we prove that our rotation-invariant approach is optimal in terms of compression. We rigorously evaluate RIQ and demonstrate its capabilities on various models and tasks. For example, RIQ facilitates $\times 19.4$ and $\times 52.9$ compression ratios on pre-trained VGG dense and pruned models, respectively, with $<0.4\%$ accuracy degradation. Code is available in \href{https://github.com/ehaleva/RIQ}{github.com/ehaleva/RIQ}.

Rotation Invariant Quantization for Model Compression

TL;DR

Abstract

and

compression ratios on pre-trained VGG dense and pruned models, respectively, with

accuracy degradation. Code is available in \href{https://github.com/ehaleva/RIQ}{github.com/ehaleva/RIQ}.

Paper Structure (22 sections, 6 theorems, 42 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 6 theorems, 42 equations, 5 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Problem Statement
Rate-Distortion Theory
Uniform Scalar Quantization
Rotation-invariant Mixed-Precision Quantization
The RIQ Algorithm
RIQ Rate-Distortion Analysis
Empirical results
Conclusion
Appendix
Proof of
Proof of
Proof of
...and 7 more sections

Key Result

Lemma 4.1

Let ${ \epsilon}_\ell \triangleq 1 - \cos(\theta_\ell)$ be the distortion of layer $\ell$. Then, in the high-rate region, the quantization bin width asymptotically satisfies

Figures (5)

Figure 1: (a) Cosine distance of the validation dataset as a function of the deviation constraint (on calibration set). Models include VGG (green), ResNet-50 (red), and ViT (blue). (b) Accuracy vs Compression for ResNet-50 model. RIQ (green) vs. RAP (orange), OCS (red), linear quantization (cyan), and HAQ (purple). HAQ, however, requires training.
Figure 2: (a) Validation of Proposition \ref{['prop: monotonically decreasing']} on ResNet-50. (b) The impact of ${ \epsilon}_0$ on the performance of RIQ on ResNet50. Higher values of ${ \epsilon}_0$ attains higher compression, yet, higher deviation.
Figure 3: (a) Illustration of the surrogate model in \ref{['model']}. Is this illustration, the quantized weights are modeled by the result of an orthogonal transformation ${\mathbf{U}}(\theta_\ell | {\mathbf{w}}_\ell)$ which rotates the vector ${\mathbf{w}}_\ell$ randomly onto a ring that is $\theta_\ell$ away from ${\mathbf{w}}_\ell$. Note that the true quantization results also lies in this ring. (b) Illustration of the projection of ${\tilde{\mathbf{w}}}_\ell$ onto the arbitrary perpendicular vectors ${\mathbf{v}}_1$ and ${\mathbf{v}}_2$.
Figure 4: (a) Rate-distortion curve for ResNet-50 model obtained for RIQ (green circles) as well as Uniform linear quantization (red squares). Rates are presented for both the quantized model (dashed) as well as following an ANS compression. (b) ResNet-50 rate per layer statistics with ${ \epsilon}_0 = 0.01$ in all layers. (c) Rate distortion curves obtained by RIQ + ANS, for a variety of models: VGG (green circles), ResNet-50 (red squares), ViT (blue diamonds), and DistilBERT (orange triangles).
Figure 5: (a) The compression ratio as a function of cosine distance. The left-bottom red triangle depicts the resulting distance of 0.0069 achieved by the baseline with a compression ratio of $\times 4$. The orange upside-down triangles depict the cosine distance and compression ratio attained by RIQ with the ANS compression. The orange line depicts the trend line. (b) MOBO optimization process. Interestingly, MOBO converges at the few last iterations to $\times 12$ compression, with a highest value of $\times 12.61$. On the other hand, RIQ reaches practically the same compression ratio in a few seconds.

Theorems & Definitions (14)

Lemma 4.1
Corollary 4.2
Proposition 4.3
Proposition 4.4
Remark 4.5
Proposition 4.6
Theorem 4.7
proof
proof
proof
...and 4 more

Rotation Invariant Quantization for Model Compression

TL;DR

Abstract

Rotation Invariant Quantization for Model Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (14)