Table of Contents
Fetching ...

LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

Minjun Kim, Jaeri Lee, Jongjin Kim, Jeongin Yun, Yongmo Kwon, U Kang

TL;DR

Quantizing Vision Transformers efficiently is challenging due to differential sensitivity across layers and between module types. LampQ addresses this by adopting layer-wise mixed-precision with a type-aware Fisher-based metric, $ oldsymbol{ extOmega}_i = oldsymbol{ extalpha}_{t} \text{tr}(oldsymbol{F}_i) $, and an ILP-based initialization followed by iterative bit updates to reflect quantization feedback. Across image classification, object detection, and zero-shot quantization, LampQ achieves state-of-the-art accuracy and significant speedups over prior PTQ methods, while remaining compatible with existing baselines like AdaLog. The approach offers a practical pathway to deploy accurate ViT quantization on resource-constrained devices and opens avenues for extension to other vision and multimodal models.

Abstract

How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.

LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

TL;DR

Quantizing Vision Transformers efficiently is challenging due to differential sensitivity across layers and between module types. LampQ addresses this by adopting layer-wise mixed-precision with a type-aware Fisher-based metric, , and an ILP-based initialization followed by iterative bit updates to reflect quantization feedback. Across image classification, object detection, and zero-shot quantization, LampQ achieves state-of-the-art accuracy and significant speedups over prior PTQ methods, while remaining compatible with existing baselines like AdaLog. The approach offers a practical pathway to deploy accurate ViT quantization on resource-constrained devices and opens avenues for extension to other vision and multimodal models.

Abstract

How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.

Paper Structure

This paper contains 35 sections, 3 theorems, 11 equations, 8 figures, 10 tables, 1 algorithm.

Key Result

Lemma 1

Thank you Assume that for all layers $l_i$ with weight vector $\mathbf{W}_i$, its gradient $\mathbf{g}_i=\mathbf{0}$, Hessian $\mathbf{H}_i\xspace$ is positive semi-definite for target loss $\mathcal{L}$, and $\exists \alpha \in \mathbb{R}, \Delta_{\mathbf{W}_i}=\widehat{\mathbf{W}}_i-\mathbf{W}_i=

Figures (8)

  • Figure 1: Accuracy when quantizing a single component of the DeiT-S model to 1-bit following AdaLog while keeping the others unchanged. Sensitivity varies significantly across (a) blocks and modules, and (b) layers.
  • Figure 2: Illustration of a ViT model with $N$ blocks. Each block consists of two modules: MSA (red) and MLP (blue), and four layers: qkv, proj, fc1, and fc2 (purple).
  • Figure 3: Illustration of how metric-based MPQ works. They first partition model parameters into groups and measure their sensitivity. More bits are allocated to sensitive groups, reducing model size while maintaining performance.
  • Figure 4: Overall architecture of LampQ. Our main ideas are I1) layer-wise mixed-precision quantization, I2) type-aware Fisher-based metric, and I3) iterative bit update.
  • Figure 5: Comparison of metric values between (a) VT-PTQ VT-PTQ and (b) LampQ for a DeiT-S model.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Lemma 1: Layer importance and Hessian trace
  • proof
  • Lemma 2: Hessian and Fisher information matrices
  • proof
  • Lemma 3: Expected ratio of reconstruction losses
  • proof
  • proof
  • proof