Table of Contents
Fetching ...

Mixed-Precision Quantization for Deep Vision Models with Integer Quadratic Programming

Zihao Deng, Sayeh Sharify, Xin Wang, Michael Orshansky

TL;DR

This work introduces CLADO, a cross-layer dependency aware MPQ method that overcomes the independence assumption of prior sensitivity-based approaches. By decomposing quantization loss with a second-order expansion into layer-specific and cross-layer terms, CLADO estimates cross-layer sensitivities in a Hessian-free, forward-only fashion and encodes the MPQ decision as an Integer Quadratic Program solved efficiently. Across CNNs and Vision Transformers on ImageNet, CLADO achieves state-of-the-art mixed-precision performance, with substantial gains under tight size constraints and robust behavior to sensitivity-set variations; PSD-based smoothing further stabilizes solutions, and the approach remains effective after quantization-aware fine-tuning. The work provides a practical, scalable framework for optimizing per-layer bit-widths that accounts for interactions between layers, offering significant practical impact for deploying compressed vision models on resource-constrained hardware.

Abstract

Quantization is a widely used technique to compress neural networks. Assigning uniform bit-widths across all layers can result in significant accuracy degradation at low precision and inefficiency at high precision. Mixed-precision quantization (MPQ) addresses this by assigning varied bit-widths to layers, optimizing the accuracy-efficiency trade-off. Existing sensitivity-based methods for MPQ assume that quantization errors across layers are independent, which leads to suboptimal choices. We introduce CLADO, a practical sensitivity-based MPQ algorithm that captures cross-layer dependency of quantization error. CLADO approximates pairwise cross-layer errors using linear equations on a small data subset. Layerwise bit-widths are assigned by optimizing a new MPQ formulation based on cross-layer quantization errors using an Integer Quadratic Program. Experiments with CNN and vision transformer models on ImageNet demonstrate that CLADO achieves state-of-the-art mixed-precision quantization performance. Code repository available here: https://github.com/JamesTuna/CLADO_MPQ

Mixed-Precision Quantization for Deep Vision Models with Integer Quadratic Programming

TL;DR

This work introduces CLADO, a cross-layer dependency aware MPQ method that overcomes the independence assumption of prior sensitivity-based approaches. By decomposing quantization loss with a second-order expansion into layer-specific and cross-layer terms, CLADO estimates cross-layer sensitivities in a Hessian-free, forward-only fashion and encodes the MPQ decision as an Integer Quadratic Program solved efficiently. Across CNNs and Vision Transformers on ImageNet, CLADO achieves state-of-the-art mixed-precision performance, with substantial gains under tight size constraints and robust behavior to sensitivity-set variations; PSD-based smoothing further stabilizes solutions, and the approach remains effective after quantization-aware fine-tuning. The work provides a practical, scalable framework for optimizing per-layer bit-widths that accounts for interactions between layers, offering significant practical impact for deploying compressed vision models on resource-constrained hardware.

Abstract

Quantization is a widely used technique to compress neural networks. Assigning uniform bit-widths across all layers can result in significant accuracy degradation at low precision and inefficiency at high precision. Mixed-precision quantization (MPQ) addresses this by assigning varied bit-widths to layers, optimizing the accuracy-efficiency trade-off. Existing sensitivity-based methods for MPQ assume that quantization errors across layers are independent, which leads to suboptimal choices. We introduce CLADO, a practical sensitivity-based MPQ algorithm that captures cross-layer dependency of quantization error. CLADO approximates pairwise cross-layer errors using linear equations on a small data subset. Layerwise bit-widths are assigned by optimizing a new MPQ formulation based on cross-layer quantization errors using an Integer Quadratic Program. Experiments with CNN and vision transformer models on ImageNet demonstrate that CLADO achieves state-of-the-art mixed-precision quantization performance. Code repository available here: https://github.com/JamesTuna/CLADO_MPQ
Paper Structure (14 sections, 13 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 13 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Sensitivity matrix of ResNet models. Entry $(i,i)$: increase in loss caused by quantizing a single layer $i$. Entry $(i,j)$: extra increase in loss when quantizing a pair of layers $(i,j)$, compared to individually quantizing them.
  • Figure 2: CNNs and ViT on the ImageNet dataset
  • Figure 3: MPQ results (QAT): QAT fine-tuning based on CLADO outperforms fine-tuning based on other MPQ methods. (Range of model sizes is chosen to be close to 3-bit UPQ. With higher model size, all MPQ algorithms tend to recover FP32 performance.)
  • Figure 4: MPQ performance vs. sample size. Data shows median performance across 24 random sensitivity sets. Colored regions cover the upper and lower performance quartiles.
  • Figure 5: Bit-width assignments to ResNet-50 along with the layer index-layer name mappings. Model size constraint is set to be 11.18MB, corresponding to 4-bit UPQ.
  • ...and 7 more figures