Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

Dayou Du; Gu Gong; Xiaowen Chu

Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

Dayou Du, Gu Gong, Xiaowen Chu

TL;DR

This survey analyzes Vision Transformers through the lens of quantization and hardware acceleration, detailing architectural bottlenecks, quantization fundamentals, and ViT-specific quantization techniques. It covers PTQ, QAT, and DFQ, with a strong emphasis on activation quantization, calibration strategies, and gradient-based training methods, plus numerous hardware-oriented approaches to accelerate quantized ViTs. Key contributions include a comparative view of activation quantizers for post-Softmax, LayerNorm, and GELU, diverse calibration and optimization strategies, and multiple hardware designs that address non-linear operation support and data movement. The work highlights practical directions for sub-8-bit ViT deployment, data-free calibration, pruning-quantization synergy, and the need for robustness and generalization across tasks and datasets, supported by open-source resources on ViT quantization and acceleration.

Abstract

Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a promising alternative to convolutional neural networks (CNNs) in several vision-related applications. However, their large model sizes and high computational and memory demands hinder deployment, especially on resource-constrained devices. This underscores the necessity of algorithm-hardware co-design specific to ViTs, aiming to optimize their performance by tailoring both the algorithmic structure and the underlying hardware accelerator to each other's strengths. Model quantization, by converting high-precision numbers to lower-precision, reduces the computational demands and memory needs of ViTs, allowing the creation of hardware specifically optimized for these quantized algorithms, boosting efficiency. This article provides a comprehensive survey of ViTs quantization and its hardware acceleration. We first delve into the unique architectural attributes of ViTs and their runtime characteristics. Subsequently, we examine the fundamental principles of model quantization, followed by a comparative analysis of the state-of-the-art quantization techniques for ViTs. Additionally, we explore the hardware acceleration of quantized ViTs, highlighting the importance of hardware-friendly algorithm design. In conclusion, this article will discuss ongoing challenges and future research paths. We consistently maintain the related open-source materials at https://github.com/DD-DuDa/awesome-vit-quantization-acceleration.

Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

TL;DR

Abstract

Paper Structure (38 sections, 15 equations, 11 figures, 5 tables)

This paper contains 38 sections, 15 equations, 11 figures, 5 tables.

Introduction
Vision Transformers Model Architecture and Performance Analysis
Overview of Vision Transformer Architecture
Variant ViTs
Roofline Model Analysis
Operations Analysis
End-to-end Analysis
Fundamental of Quantization
Linear Quantization
Symmetric and Asymmetric Quantization
Static and Dynamic Quantization
Quantization Granularity
Post Training Quantization
Quantization Aware Training
Data Free Quantization
...and 23 more sections

Figures (11)

Figure 1: Overview diagram for the survey on effective low-bit ViTs Inference.
Figure 2: Architecture of the Vision Transformer (ViT): The left illustrates the process of image division and positional embedding, while the right panel delineates the standard encoder architecture with its various operations, as detailed in vit. The abbreviation 'BMM' refers to batch matrix multiplication.
Figure 3: The Roofline model for Nvidia RTX4090 GPU, with computations done in FP16 and INT8.
Figure 4: This figure displays the GFLOPs, GMOPs, and Arithmetic Intensity for various sizes of ViTs with different image sizes.
Figure 5: Distribution of the last module's post-Softmax, post-Gelu and Post-LayerNorm activation in Deit-Base.
...and 6 more figures

Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

TL;DR

Abstract

Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (11)