Table of Contents
Fetching ...

AIQViT: Architecture-Informed Post-Training Quantization for Vision Transformers

Runqing Jiang, Ye Zhang, Longguang Wang, Pengpeng Yu, Yulan Guo

TL;DR

Vision Transformers offer strong accuracy but heavy compute and memory demands; AIQViT addresses this with an architecture-informed low-rank compensation to mitigate weight-quantization loss and a Dynamic Focusing Quantizer to handle unbalanced post-Softmax activations without log-based schemes. The method employs differentiable architecture search to choose low-rank ranks and a curriculum-based optimization to stabilize training, yielding robust performance across image classification, object detection/segmentation, and 3D point cloud tasks at ultra-low bit-widths. Across five vision tasks and multiple ViT variants, AIQViT consistently outperforms state-of-the-art PTQ methods, achieving near FP accuracy on several settings and enabling practical deployment of ViTs on resource-constrained devices. This work thus advances practical PTQ for ViTs by combining architecture-aware compensation, interval-focused quantization, and efficient optimization strategies.

Abstract

Post-training quantization (PTQ) has emerged as a promising solution for reducing the storage and computational cost of vision transformers (ViTs). Recent advances primarily target at crafting quantizers to deal with peculiar activations characterized by ViTs. However, most existing methods underestimate the information loss incurred by weight quantization, resulting in significant performance deterioration, particularly in low-bit cases. Furthermore, a common practice in quantizing post-Softmax activations of ViTs is to employ logarithmic transformations, which unfortunately prioritize less informative values around zero. This approach introduces additional redundancies, ultimately leading to suboptimal quantization efficacy. To handle these, this paper proposes an innovative PTQ method tailored for ViTs, termed AIQViT (Architecture-Informed Post-training Quantization for ViTs). First, we design an architecture-informed low rank compensation mechanism, wherein learnable low-rank weights are introduced to compensate for the degradation caused by weight quantization. Second, we design a dynamic focusing quantizer to accommodate the unbalanced distribution of post-Softmax activations, which dynamically selects the most valuable interval for higher quantization resolution. Extensive experiments on five vision tasks, including image classification, object detection, instance segmentation, point cloud classification, and point cloud part segmentation, demonstrate the superiority of AIQViT over state-of-the-art PTQ methods.

AIQViT: Architecture-Informed Post-Training Quantization for Vision Transformers

TL;DR

Vision Transformers offer strong accuracy but heavy compute and memory demands; AIQViT addresses this with an architecture-informed low-rank compensation to mitigate weight-quantization loss and a Dynamic Focusing Quantizer to handle unbalanced post-Softmax activations without log-based schemes. The method employs differentiable architecture search to choose low-rank ranks and a curriculum-based optimization to stabilize training, yielding robust performance across image classification, object detection/segmentation, and 3D point cloud tasks at ultra-low bit-widths. Across five vision tasks and multiple ViT variants, AIQViT consistently outperforms state-of-the-art PTQ methods, achieving near FP accuracy on several settings and enabling practical deployment of ViTs on resource-constrained devices. This work thus advances practical PTQ for ViTs by combining architecture-aware compensation, interval-focused quantization, and efficient optimization strategies.

Abstract

Post-training quantization (PTQ) has emerged as a promising solution for reducing the storage and computational cost of vision transformers (ViTs). Recent advances primarily target at crafting quantizers to deal with peculiar activations characterized by ViTs. However, most existing methods underestimate the information loss incurred by weight quantization, resulting in significant performance deterioration, particularly in low-bit cases. Furthermore, a common practice in quantizing post-Softmax activations of ViTs is to employ logarithmic transformations, which unfortunately prioritize less informative values around zero. This approach introduces additional redundancies, ultimately leading to suboptimal quantization efficacy. To handle these, this paper proposes an innovative PTQ method tailored for ViTs, termed AIQViT (Architecture-Informed Post-training Quantization for ViTs). First, we design an architecture-informed low rank compensation mechanism, wherein learnable low-rank weights are introduced to compensate for the degradation caused by weight quantization. Second, we design a dynamic focusing quantizer to accommodate the unbalanced distribution of post-Softmax activations, which dynamically selects the most valuable interval for higher quantization resolution. Extensive experiments on five vision tasks, including image classification, object detection, instance segmentation, point cloud classification, and point cloud part segmentation, demonstrate the superiority of AIQViT over state-of-the-art PTQ methods.

Paper Structure

This paper contains 20 sections, 17 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The performances of different approaches on different tasks (including image classification, object detection, instance segmentation, point cloud classification, and point cloud part segmentation), where the W4/A4 setting is used for quantization. Best viewed in color.
  • Figure 2: Overview of architecture-informed low-rank compensation. First, we employ differential architecture search to identify the optimal rank $r$ from a candidate architecture set. Subsequently, we freeze the original weights and optimize the selected low-rank weights by minimizing the reconstruction loss between the full-precision block and the quantized block. During this process, the training set is incrementally expanded in a curriculum learning (CL) manner.
  • Figure 3: (a) Histogram of the first MHSA module’s post-Softmax activations in DeiT-T. (b) log2 quantizer (in blue) and DFQ (in orange). (c) Results on ImageNet with W3/A3 quantization. "DFQ(fixed)" means all the layers use the same interval. Best viewed in color.
  • Figure 4: Visualization of the learned intervals and the influences of calibration data size. (a) Visualization of learned intervals for DeiT-T with W4/A4 quantization. (b) Effect of # data points for DeiT-T quantization on ImageNet.