Table of Contents
Fetching ...

P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer

Huihong Shi, Xin Cheng, Wendong Mao, Zhongfeng Wang

TL;DR

This paper tackles the memory and compute bottlenecks of Vision Transformers by introducing P$^2$-ViT, a Power-of-Two post-training quantization and accelerator framework. It replaces floating-point scaling with PoT scaling via adaptive PoT rounding and PoT-aware smoothing, and blends a coarse-to-fine automatic mixed-precision strategy to optimize accuracy under size constraints. A chunk-based accelerator with a tailored row-stationary dataflow is proposed to maximize throughput and minimize re-quantization overhead, enabling end-to-end fully quantized ViTs. Empirically, P$^2$-ViT achieves comparable or superior quantization accuracy to floating-point baselines and delivers up to 10.1× speedups and 36.8× energy savings over Turing Tensor Cores, along with substantial improvements in computation utilization over SOTA ViT accelerators.

Abstract

Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield non-negligible re-quantization overhead, limiting ViTs' hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose \emph{P$^2$-ViT}, the first \underline{P}ower-of-Two (PoT) \underline{p}ost-training quantization and acceleration framework to accelerate fully quantized ViTs. Specifically, {as for quantization,} we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the re-quantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency trade-offs. {In terms of hardware,} we develop {a dedicated chunk-based accelerator} featuring multiple tailored sub-processors to individually handle ViTs' different types of operations, alleviating reconfigurable overhead. Additionally, we design {a tailored row-stationary dataflow} to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P$^2$-ViT's effectiveness. {Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors. Besides, we achieve up to $\mathbf{10.1\times}$ speedup and $\mathbf{36.8\times}$ energy saving over GPU's Turing Tensor Cores, and up to $\mathbf{1.84\times}$ higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at \url{https://github.com/shihuihong214/P2-ViT}.

P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer

TL;DR

This paper tackles the memory and compute bottlenecks of Vision Transformers by introducing P-ViT, a Power-of-Two post-training quantization and accelerator framework. It replaces floating-point scaling with PoT scaling via adaptive PoT rounding and PoT-aware smoothing, and blends a coarse-to-fine automatic mixed-precision strategy to optimize accuracy under size constraints. A chunk-based accelerator with a tailored row-stationary dataflow is proposed to maximize throughput and minimize re-quantization overhead, enabling end-to-end fully quantized ViTs. Empirically, P-ViT achieves comparable or superior quantization accuracy to floating-point baselines and delivers up to 10.1× speedups and 36.8× energy savings over Turing Tensor Cores, along with substantial improvements in computation utilization over SOTA ViT accelerators.

Abstract

Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield non-negligible re-quantization overhead, limiting ViTs' hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose \emph{P-ViT}, the first \underline{P}ower-of-Two (PoT) \underline{p}ost-training quantization and acceleration framework to accelerate fully quantized ViTs. Specifically, {as for quantization,} we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the re-quantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency trade-offs. {In terms of hardware,} we develop {a dedicated chunk-based accelerator} featuring multiple tailored sub-processors to individually handle ViTs' different types of operations, alleviating reconfigurable overhead. Additionally, we design {a tailored row-stationary dataflow} to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P-ViT's effectiveness. {Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors. Besides, we achieve up to speedup and energy saving over GPU's Turing Tensor Cores, and up to higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at \url{https://github.com/shihuihong214/P2-ViT}.
Paper Structure (20 sections, 22 equations, 10 figures, 11 tables, 1 algorithm)

This paper contains 20 sections, 22 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: The overview of our P$^2$-ViT algorithm and hardware co-design framework. Specifically, (a) P$^2$-ViT's algorithm integrates a dedicated quantization scheme to obtain fully quantized ViTs with Power-of-Two (PoT) scaling factors, and further comprises coarse-to-fine automatic mixed-precision quantization to achieve better accuracy-efficiency trade-offs. (b) Furthermore, P$^2$-ViT's dedicated accelerator advocates a chunk-based design incorporating a tailored row-stationary dataflow to boost hardware efficiency.
  • Figure 2: The re-quantization processes with (a) vanilla floating-point (FP) scaling factors and (b) our proposed Power-of-Two (PoT) scaling factors.
  • Figure 3: The illustration of standard Vision Transformers' (ViTs') architecture (e.g., ViT vit and DeiT deit) that consists of multiple Transformer blocks. Each block includes a Multi-head Self-Attention module (MSA) and a Multi-Layer Perceptron (MLP). 'MatMul.' is the abbreviation of matrix multiplications.
  • Figure 4: Illustrating the (a) layer-wise, (b) group-wise, and (c) channel-wise quantization for activations, and (d) feature-wise quantization for weights.
  • Figure 5: The minimum and maximum values of the last (a) LN's input, (b) LN's output, (c) smoothed LN's output, and (d) residual path's output, along the channel dimension in the full-precision ViT-Base, where inputs are randomly sampled from ImageNet Deng2009ImageNetAL.
  • ...and 5 more figures