Table of Contents
Fetching ...

A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking

Lorenzo Papa, Paolo Russo, Irene Amerini, Luping Zhou

TL;DR

This survey addresses the efficiency bottleneck of Vision Transformers by organizing methods into four categories: compact architecture design, pruning, knowledge distillation, and quantization. It introduces Efficient Error Rate (EER) as a unified metric to compare inference-time efficiency across models and datasets. The paper surveys state-of-the-art techniques, provides mathematical framing, and benchmarks on ImageNet1K (with supplementary COCO/ADE20K) to reveal Pareto-optimal trade-offs, such as Castling-MViTv2-T for CA and DynamicViT variants for KD/pruning. It also discusses open challenges and future directions, emphasizing multi-strategy integration and hardware-aware benchmarks to drive real-world applicability of efficient ViTs.

Abstract

Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism, outperforming earlier convolutional neural networks. However, ViT deployment and performance have grown steadily with their size, number of trainable parameters, and operations. Furthermore, self-attention's computational and memory cost quadratically increases with the image resolution. Generally speaking, it is challenging to employ these architectures in real-world applications due to many hardware and environmental restrictions, such as processing and computational capabilities. Therefore, this survey investigates the most efficient methodologies to ensure sub-optimal estimation performances. More in detail, four efficient categories will be analyzed: compact architecture, pruning, knowledge distillation, and quantization strategies. Moreover, a new metric called Efficient Error Rate has been introduced in order to normalize and compare models' features that affect hardware devices at inference time, such as the number of parameters, bits, FLOPs, and model size. Summarizing, this paper firstly mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios. Toward the end of this paper, we also discuss open challenges and promising research directions.

A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking

TL;DR

This survey addresses the efficiency bottleneck of Vision Transformers by organizing methods into four categories: compact architecture design, pruning, knowledge distillation, and quantization. It introduces Efficient Error Rate (EER) as a unified metric to compare inference-time efficiency across models and datasets. The paper surveys state-of-the-art techniques, provides mathematical framing, and benchmarks on ImageNet1K (with supplementary COCO/ADE20K) to reveal Pareto-optimal trade-offs, such as Castling-MViTv2-T for CA and DynamicViT variants for KD/pruning. It also discusses open challenges and future directions, emphasizing multi-strategy integration and hardware-aware benchmarks to drive real-world applicability of efficient ViTs.

Abstract

Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism, outperforming earlier convolutional neural networks. However, ViT deployment and performance have grown steadily with their size, number of trainable parameters, and operations. Furthermore, self-attention's computational and memory cost quadratically increases with the image resolution. Generally speaking, it is challenging to employ these architectures in real-world applications due to many hardware and environmental restrictions, such as processing and computational capabilities. Therefore, this survey investigates the most efficient methodologies to ensure sub-optimal estimation performances. More in detail, four efficient categories will be analyzed: compact architecture, pruning, knowledge distillation, and quantization strategies. Moreover, a new metric called Efficient Error Rate has been introduced in order to normalize and compare models' features that affect hardware devices at inference time, such as the number of parameters, bits, FLOPs, and model size. Summarizing, this paper firstly mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios. Toward the end of this paper, we also discuss open challenges and promising research directions.
Paper Structure (23 sections, 28 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 23 sections, 28 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Graphical overview of the vanilla self-attention and the multi-head self-attention blocks. The $\mathcal{O}(n^2)$ in the softmax dot-product self-attention highlights the quadratic cost of each operation.
  • Figure 2: Graphical representation of the vanilla KD learning strategy. Please refer to Section \ref{['sec:KD_background']} for the used notation.
  • Figure 3: Graphical representation of a general quantization procedure; the floating-point 32-bit data is compressed based on the quantization function $\Psi_k (x, \Delta)$ to a k-bit representation.
  • Figure 4: Graphical representation of efficient ViT techniques and their optimization effects. The fundamental element on which the VITs are built is shown on top, where the dashed orange block highlights the component on which each optimization technique mainly focuses. A graphical depiction of how the optimizations influence the block of interest is also provided at the bottom of the image. Please refer to Section \ref{['sec:general_background']} for the reported variables and their description.
  • Figure 5: Graphical comparison of CA model Top-1 accuracies and EER values on the ImageNet-1K dataset. The area of bubbles corresponds to the number of trainable parameters (#Param). The reference #Param sizes (from 10M to 100M) are shown in gray in the bottom right corner.
  • ...and 2 more figures