Mix-QViT: Mixed-Precision Vision Transformer Quantization Driven by Layer Importance and Quantization Sensitivity
Navin Ranjan, Andreas Savakis
TL;DR
Mix-QViT addresses the challenge of efficiently quantizing vision transformers by integrating explainability-driven layer importance with quantization sensitivity to guide per-layer bit allocation under resource constraints via an Integer Quadratic Program. It couples PTQ enhancements, notably clipped channel-wise reparameterization for post-LayerNorm activations, with log-based quantization for power-law activations to improve stability and accuracy. The framework yields substantial PTQ gains over state-of-the-art methods at 3–6 bits across ViT, DeiT, and Swin, and enables near full-precision performance in QAT at 2-bit mixed precision. Together, these contributions provide an interpretable, scalable MPQ approach that boosts practicality of deploying vision transformers on resource-constrained platforms across classification, detection, and segmentation tasks.
Abstract
In this paper, we propose Mix-QViT, an explainability-driven MPQ framework that systematically allocates bit-widths to each layer based on two criteria: layer importance, assessed via Layer-wise Relevance Propagation (LRP), which identifies how much each layer contributes to the final classification, and quantization sensitivity, determined by evaluating the performance impact of quantizing each layer at various precision levels while keeping others layers at a baseline. Additionally, for post-training quantization (PTQ), we introduce a clipped channel-wise quantization method designed to reduce the effects of extreme outliers in post-LayerNorm activations by removing severe inter-channel variations. We validate our approach by applying Mix-QViT to ViT, DeiT, and Swin Transformer models across multiple datasets. Our experimental results for PTQ demonstrate that both fixed-bit and mixed-bit methods outperform existing techniques, particularly at 3-bit, 4-bit, and 6-bit precision. Furthermore, in quantization-aware training, Mix-QViT achieves superior performance with 2-bit mixed-precision.
