Table of Contents
Fetching ...

Mix-QViT: Mixed-Precision Vision Transformer Quantization Driven by Layer Importance and Quantization Sensitivity

Navin Ranjan, Andreas Savakis

TL;DR

Mix-QViT addresses the challenge of efficiently quantizing vision transformers by integrating explainability-driven layer importance with quantization sensitivity to guide per-layer bit allocation under resource constraints via an Integer Quadratic Program. It couples PTQ enhancements, notably clipped channel-wise reparameterization for post-LayerNorm activations, with log-based quantization for power-law activations to improve stability and accuracy. The framework yields substantial PTQ gains over state-of-the-art methods at 3–6 bits across ViT, DeiT, and Swin, and enables near full-precision performance in QAT at 2-bit mixed precision. Together, these contributions provide an interpretable, scalable MPQ approach that boosts practicality of deploying vision transformers on resource-constrained platforms across classification, detection, and segmentation tasks.

Abstract

In this paper, we propose Mix-QViT, an explainability-driven MPQ framework that systematically allocates bit-widths to each layer based on two criteria: layer importance, assessed via Layer-wise Relevance Propagation (LRP), which identifies how much each layer contributes to the final classification, and quantization sensitivity, determined by evaluating the performance impact of quantizing each layer at various precision levels while keeping others layers at a baseline. Additionally, for post-training quantization (PTQ), we introduce a clipped channel-wise quantization method designed to reduce the effects of extreme outliers in post-LayerNorm activations by removing severe inter-channel variations. We validate our approach by applying Mix-QViT to ViT, DeiT, and Swin Transformer models across multiple datasets. Our experimental results for PTQ demonstrate that both fixed-bit and mixed-bit methods outperform existing techniques, particularly at 3-bit, 4-bit, and 6-bit precision. Furthermore, in quantization-aware training, Mix-QViT achieves superior performance with 2-bit mixed-precision.

Mix-QViT: Mixed-Precision Vision Transformer Quantization Driven by Layer Importance and Quantization Sensitivity

TL;DR

Mix-QViT addresses the challenge of efficiently quantizing vision transformers by integrating explainability-driven layer importance with quantization sensitivity to guide per-layer bit allocation under resource constraints via an Integer Quadratic Program. It couples PTQ enhancements, notably clipped channel-wise reparameterization for post-LayerNorm activations, with log-based quantization for power-law activations to improve stability and accuracy. The framework yields substantial PTQ gains over state-of-the-art methods at 3–6 bits across ViT, DeiT, and Swin, and enables near full-precision performance in QAT at 2-bit mixed precision. Together, these contributions provide an interpretable, scalable MPQ approach that boosts practicality of deploying vision transformers on resource-constrained platforms across classification, detection, and segmentation tasks.

Abstract

In this paper, we propose Mix-QViT, an explainability-driven MPQ framework that systematically allocates bit-widths to each layer based on two criteria: layer importance, assessed via Layer-wise Relevance Propagation (LRP), which identifies how much each layer contributes to the final classification, and quantization sensitivity, determined by evaluating the performance impact of quantizing each layer at various precision levels while keeping others layers at a baseline. Additionally, for post-training quantization (PTQ), we introduce a clipped channel-wise quantization method designed to reduce the effects of extreme outliers in post-LayerNorm activations by removing severe inter-channel variations. We validate our approach by applying Mix-QViT to ViT, DeiT, and Swin Transformer models across multiple datasets. Our experimental results for PTQ demonstrate that both fixed-bit and mixed-bit methods outperform existing techniques, particularly at 3-bit, 4-bit, and 6-bit precision. Furthermore, in quantization-aware training, Mix-QViT achieves superior performance with 2-bit mixed-precision.
Paper Structure (18 sections, 20 equations, 7 figures, 6 tables)

This paper contains 18 sections, 20 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Box plot of the post-LayerNorm activations for channels 300 to 350 in DeiT-S: (a) activations from the blocks.0.attn.qkv layer, and (b) activations from the blocks.7.mlp.fc1 layer. Extreme values are marked with black circle.
  • Figure 2: Overview of the Mix-QViT framework. (a) Layer Importance Score ($\Omega$), calculated offline on a small sample of 256 images from the ImageNet1K validation dataset, based on the Layer-wise Relevance Propagation (LRP) method. (b) Quantization sensitivity score ($\Lambda$), calculated offline on 256 images from the ImageNet1K validation dataset. Here, performance changes are recorded between two models: one where all layers are quantized at baseline precision, and another where the target layer in each transformer block is quantized at different precision. (c) Mixed-precision bit allocation strategy, based on the layer importance score and quantization sensitivity score, which, under model constraints, generates an optimal mixed-bit allocation.
  • Figure 3: Layer Importance Score of DeiT-Small. Each value refers to the layer relative importance toward model classification.
  • Figure 4: Quantization sensitivity analysis of DeiT-Small transformer blocks using CLR-RQViT at different bit-widths. Each value shows the accuracy change when a target layer is quantized at a specific precision, with all other layers fixed at 4-bit precision, compared to the fully 4-bit quantized model.
  • Figure 5: Quantization Sensitivity Score of DeiT-Small based on the (a) RQViT method and (b) AQViT method, respectively. Each value represents the layer's robustness to quantization as a percentage, with higher values indicating greater sensitivity to quantization error.
  • ...and 2 more figures