Table of Contents
Fetching ...

Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

Dayou Du, Gu Gong, Xiaowen Chu

TL;DR

This survey analyzes Vision Transformers through the lens of quantization and hardware acceleration, detailing architectural bottlenecks, quantization fundamentals, and ViT-specific quantization techniques. It covers PTQ, QAT, and DFQ, with a strong emphasis on activation quantization, calibration strategies, and gradient-based training methods, plus numerous hardware-oriented approaches to accelerate quantized ViTs. Key contributions include a comparative view of activation quantizers for post-Softmax, LayerNorm, and GELU, diverse calibration and optimization strategies, and multiple hardware designs that address non-linear operation support and data movement. The work highlights practical directions for sub-8-bit ViT deployment, data-free calibration, pruning-quantization synergy, and the need for robustness and generalization across tasks and datasets, supported by open-source resources on ViT quantization and acceleration.

Abstract

Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a promising alternative to convolutional neural networks (CNNs) in several vision-related applications. However, their large model sizes and high computational and memory demands hinder deployment, especially on resource-constrained devices. This underscores the necessity of algorithm-hardware co-design specific to ViTs, aiming to optimize their performance by tailoring both the algorithmic structure and the underlying hardware accelerator to each other's strengths. Model quantization, by converting high-precision numbers to lower-precision, reduces the computational demands and memory needs of ViTs, allowing the creation of hardware specifically optimized for these quantized algorithms, boosting efficiency. This article provides a comprehensive survey of ViTs quantization and its hardware acceleration. We first delve into the unique architectural attributes of ViTs and their runtime characteristics. Subsequently, we examine the fundamental principles of model quantization, followed by a comparative analysis of the state-of-the-art quantization techniques for ViTs. Additionally, we explore the hardware acceleration of quantized ViTs, highlighting the importance of hardware-friendly algorithm design. In conclusion, this article will discuss ongoing challenges and future research paths. We consistently maintain the related open-source materials at https://github.com/DD-DuDa/awesome-vit-quantization-acceleration.

Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

TL;DR

This survey analyzes Vision Transformers through the lens of quantization and hardware acceleration, detailing architectural bottlenecks, quantization fundamentals, and ViT-specific quantization techniques. It covers PTQ, QAT, and DFQ, with a strong emphasis on activation quantization, calibration strategies, and gradient-based training methods, plus numerous hardware-oriented approaches to accelerate quantized ViTs. Key contributions include a comparative view of activation quantizers for post-Softmax, LayerNorm, and GELU, diverse calibration and optimization strategies, and multiple hardware designs that address non-linear operation support and data movement. The work highlights practical directions for sub-8-bit ViT deployment, data-free calibration, pruning-quantization synergy, and the need for robustness and generalization across tasks and datasets, supported by open-source resources on ViT quantization and acceleration.

Abstract

Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a promising alternative to convolutional neural networks (CNNs) in several vision-related applications. However, their large model sizes and high computational and memory demands hinder deployment, especially on resource-constrained devices. This underscores the necessity of algorithm-hardware co-design specific to ViTs, aiming to optimize their performance by tailoring both the algorithmic structure and the underlying hardware accelerator to each other's strengths. Model quantization, by converting high-precision numbers to lower-precision, reduces the computational demands and memory needs of ViTs, allowing the creation of hardware specifically optimized for these quantized algorithms, boosting efficiency. This article provides a comprehensive survey of ViTs quantization and its hardware acceleration. We first delve into the unique architectural attributes of ViTs and their runtime characteristics. Subsequently, we examine the fundamental principles of model quantization, followed by a comparative analysis of the state-of-the-art quantization techniques for ViTs. Additionally, we explore the hardware acceleration of quantized ViTs, highlighting the importance of hardware-friendly algorithm design. In conclusion, this article will discuss ongoing challenges and future research paths. We consistently maintain the related open-source materials at https://github.com/DD-DuDa/awesome-vit-quantization-acceleration.
Paper Structure (38 sections, 15 equations, 11 figures, 5 tables)

This paper contains 38 sections, 15 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overview diagram for the survey on effective low-bit ViTs Inference.
  • Figure 2: Architecture of the Vision Transformer (ViT): The left illustrates the process of image division and positional embedding, while the right panel delineates the standard encoder architecture with its various operations, as detailed in vit. The abbreviation 'BMM' refers to batch matrix multiplication.
  • Figure 3: The Roofline model for Nvidia RTX4090 GPU, with computations done in FP16 and INT8.
  • Figure 4: This figure displays the GFLOPs, GMOPs, and Arithmetic Intensity for various sizes of ViTs with different image sizes.
  • Figure 5: Distribution of the last module's post-Softmax, post-Gelu and Post-LayerNorm activation in Deit-Base.
  • ...and 6 more figures