Table of Contents
Fetching ...

Instance-Aware Group Quantization for Vision Transformers

Jaehyeon Moon, Dohyung Kim, Junyong Cheon, Bumsub Ham

TL;DR

This work tackles the challenge of post-training quantization for Vision Transformers, where per-channel activation distributions and token-wise attentions vary significantly across inputs, undermining traditional layer-wide quantization. The authors introduce IGQ-ViT, which dynamically partitions activation channels and softmax attentions into instance-aware groups and applies separate quantizers per group, coupled with a group-size allocation strategy under a bit-operation budget. The approach yields state-of-the-art PTQ performance across ViT variants on ImageNet, COCO, and DETR benchmarks, achieving near full-precision accuracy with modest group counts and limited calibration data. Overall, IGQ-ViT enables efficient, accurate quantization of ViTs for deployment on resource-constrained devices, addressing a critical gap in transformer quantization.

Abstract

Post-training quantization (PTQ) is an efficient model compression technique that quantizes a pretrained full-precision model using only a small calibration set of unlabeled samples without retraining. PTQ methods for convolutional neural networks (CNNs) provide quantization results comparable to full-precision counterparts. Directly applying them to vision transformers (ViTs), however, incurs severe performance degradation, mainly due to the differences in architectures between CNNs and ViTs. In particular, the distribution of activations for each channel vary drastically according to input instances, making PTQ methods for CNNs inappropriate for ViTs. To address this, we introduce instance-aware group quantization for ViTs (IGQ-ViT). To this end, we propose to split the channels of activation maps into multiple groups dynamically for each input instance, such that activations within each group share similar statistical properties. We also extend our scheme to quantize softmax attentions across tokens. In addition, the number of groups for each layer is adjusted to minimize the discrepancies between predictions from quantized and full-precision models, under a bit-operation (BOP) constraint. We show extensive experimental results on image classification, object detection, and instance segmentation, with various transformer architectures, demonstrating the effectiveness of our approach.

Instance-Aware Group Quantization for Vision Transformers

TL;DR

This work tackles the challenge of post-training quantization for Vision Transformers, where per-channel activation distributions and token-wise attentions vary significantly across inputs, undermining traditional layer-wide quantization. The authors introduce IGQ-ViT, which dynamically partitions activation channels and softmax attentions into instance-aware groups and applies separate quantizers per group, coupled with a group-size allocation strategy under a bit-operation budget. The approach yields state-of-the-art PTQ performance across ViT variants on ImageNet, COCO, and DETR benchmarks, achieving near full-precision accuracy with modest group counts and limited calibration data. Overall, IGQ-ViT enables efficient, accurate quantization of ViTs for deployment on resource-constrained devices, addressing a critical gap in transformer quantization.

Abstract

Post-training quantization (PTQ) is an efficient model compression technique that quantizes a pretrained full-precision model using only a small calibration set of unlabeled samples without retraining. PTQ methods for convolutional neural networks (CNNs) provide quantization results comparable to full-precision counterparts. Directly applying them to vision transformers (ViTs), however, incurs severe performance degradation, mainly due to the differences in architectures between CNNs and ViTs. In particular, the distribution of activations for each channel vary drastically according to input instances, making PTQ methods for CNNs inappropriate for ViTs. To address this, we introduce instance-aware group quantization for ViTs (IGQ-ViT). To this end, we propose to split the channels of activation maps into multiple groups dynamically for each input instance, such that activations within each group share similar statistical properties. We also extend our scheme to quantize softmax attentions across tokens. In addition, the number of groups for each layer is adjusted to minimize the discrepancies between predictions from quantized and full-precision models, under a bit-operation (BOP) constraint. We show extensive experimental results on image classification, object detection, and instance segmentation, with various transformer architectures, demonstrating the effectiveness of our approach.
Paper Structure (27 sections, 11 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 27 sections, 11 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Visual comparison of group quantization and IGQ-ViT. (a) Conventional group quantization techniques dai2021vsshen2020q divide consecutive channels uniformly into a number of groups without considering their dynamic ranges. The distribution of activations in each group varies significantly for individual input instances. (b) To alleviate this problem, IGQ-ViT proposes an instance-aware grouping technique that splits the channels of activation maps and softmax attentions across tokens dynamically for each input instance at runtime.
  • Figure 2: (a) Plots of standard deviations of activations across channels for DeiT-S touvron2021training; (b-c) Boxplots of activation values across different input instances for a particular channel of ResNet-50 he2016deep and DeiT-S, respectively. We use ImageNet deng2009imagenet for the visualizations. We have observed that there is a significant scale variation across channels, and the activation ranges for each channel change drastically among different samples for ViTs, in contrast to CNNs.
  • Figure 3: Distributions of softmax attentions across tokens. We can see that the distributions are different significantly across tokens. Our approach can handle this issue by splitting the rows of softmax attentions into several groups and applying separate quantizers for each group, such that the attentions assigned to each group share similar statistical properties.
  • Figure 4: Comparisons for dynamic ranges of activation values across channels, chosen from different layers of ViT-S dosovitskiy2020image. $\sigma_{range}$ is the standard deviation of the dynamic ranges of channels for each layer. We can see that the degree of scale variations across channels varies according to the layer, suggesting that the number of groups for each layer would be adjusted.
  • Figure 5: Top-1 validation accuracies on ImageNet deng2009imagenet w.r.t. group sizes for linear operations ($G_1$, left) and that for softmax attentions ($G_2$, right). We set either $G_1$ or $G_2$ to 8, while varying the other to compute the accuracies. We report the quantization results of ViT-S dosovitskiy2020image, Swin-T liu2021swin, and DeiT-B touvron2021training under a 4/4-bit setting, with a fixed group size across different layers. We visualize the upper bounds with horizontal stripes of corresponding colors.
  • ...and 4 more figures