Table of Contents
Fetching ...

AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer

Zhuguanyu Wu, Jiaxin Chen, Hanwen Zhong, Di Huang, Yunhong Wang

TL;DR

This paper tackles the challenge of deploying Vision Transformers with Post-Training Quantization by addressing power-law activation distributions that hinder low-bit accuracy. It introduces AdaLog, an adaptive-logarithm base quantizer, paired with bias reparameterization to quantize post-Softmax and post-GELU activations, and a Fast Progressive Combining Search to efficiently find bases and scaling factors. The method demonstrates superior accuracy over state-of-the-art PTQ approaches across ViT variants on ImageNet and COCO, particularly at 4- and 3-bit quantization, while maintaining hardware-friendly inference via table lookups. The work significantly improves practical ViT deployment on edge devices and provides a reusable framework for quantization of attention-based models.

Abstract

Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. Despite the high accuracy, deploying it in real applications raises critical challenges including the high computational cost and inference latency. Recently, the post-training quantization (PTQ) technique has emerged as a promising way to enhance ViT's efficiency. Nevertheless, existing PTQ approaches for ViT suffer from the inflexible quantization on the post-Softmax and post-GELU activations that obey the power-law-like distributions. To address these issues, we propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer. It optimizes the logarithmic base to accommodate the power-law-like distribution of activations, while simultaneously allowing for hardware-friendly quantization and de-quantization. By employing the bias reparameterization, the AdaLog quantizer is applicable to both the post-Softmax and post-GELU activations. Moreover, we develop an efficient Fast Progressive Combining Search (FPCS) strategy to determine the optimal logarithm base for AdaLog, as well as the scaling factors and zero points for the uniform quantizers. Extensive experimental results on public benchmarks demonstrate the effectiveness of our approach for various ViT-based architectures and vision tasks including classification, object detection, and instance segmentation. Code is available at https://github.com/GoatWu/AdaLog.

AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer

TL;DR

This paper tackles the challenge of deploying Vision Transformers with Post-Training Quantization by addressing power-law activation distributions that hinder low-bit accuracy. It introduces AdaLog, an adaptive-logarithm base quantizer, paired with bias reparameterization to quantize post-Softmax and post-GELU activations, and a Fast Progressive Combining Search to efficiently find bases and scaling factors. The method demonstrates superior accuracy over state-of-the-art PTQ approaches across ViT variants on ImageNet and COCO, particularly at 4- and 3-bit quantization, while maintaining hardware-friendly inference via table lookups. The work significantly improves practical ViT deployment on edge devices and provides a reusable framework for quantization of attention-based models.

Abstract

Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. Despite the high accuracy, deploying it in real applications raises critical challenges including the high computational cost and inference latency. Recently, the post-training quantization (PTQ) technique has emerged as a promising way to enhance ViT's efficiency. Nevertheless, existing PTQ approaches for ViT suffer from the inflexible quantization on the post-Softmax and post-GELU activations that obey the power-law-like distributions. To address these issues, we propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer. It optimizes the logarithmic base to accommodate the power-law-like distribution of activations, while simultaneously allowing for hardware-friendly quantization and de-quantization. By employing the bias reparameterization, the AdaLog quantizer is applicable to both the post-Softmax and post-GELU activations. Moreover, we develop an efficient Fast Progressive Combining Search (FPCS) strategy to determine the optimal logarithm base for AdaLog, as well as the scaling factors and zero points for the uniform quantizers. Extensive experimental results on public benchmarks demonstrate the effectiveness of our approach for various ViT-based architectures and vision tasks including classification, object detection, and instance segmentation. Code is available at https://github.com/GoatWu/AdaLog.
Paper Structure (18 sections, 12 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 18 sections, 12 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Histogram of post-Softmax activations. (a)-(b): In 4-bit quantization, the log$\sqrt{2}$ quantizer allocates more bits to the relatively important large values compared to the log2 quantizer, thus reaching higher accuracy. (c)-(d): In 3-bit quantization, the log$\sqrt{2}$ quantizer quantizes the majority of values to 0, leading to significant degradation.
  • Figure 2: Illustration on the framework of our method. The AdaLog quantizer is employed for quantizing the post-Softmax and post-GELU activations, where the bias reparameterization is specifically integrated to extend AdaLog to the post-GELU layers. The Fast Progressive Combining Search (FPCS) strategy facilitates AdaLog to search for the optimal scaling factors and logarithm base, as well as the scaling factors and zero points of the uniform quantizers.
  • Figure 3: (a) is the flowchart for linear quantized data. (b) shows the flowchart of the $\log\sqrt{2}$ quantizer RepQViT that fails to avoid the floating-point multiplication operation, which is not hardware-friendly. (c) displays the flowchart of the proposed AdaLog method, which only takes two extra table lookup operations and one bit-shift operation compared to the standard linear integer multiplication, making it efficient and hardware-friendly.
  • Figure 4: Illustration on the distribution of post-GeLU activations. (a) and (b) are the distributions of post-GeLU activation values from different layers of ViT-Base. It can be observed that although they both follow power-law-like distributions, their value ranges substantially differ, showing the necessity of adaptive logarithm bases.