Table of Contents
Fetching ...

Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies

Shaibal Saha, Lanyu Xu

TL;DR

Vision Transformers offer strong performance in computer vision but face prohibitive compute and memory demands on edge devices. This survey systematically catalogues ViT-focused model compression (pruning, knowledge distillation, quantization) and hardware-aware acceleration strategies, emphasizing SW-HW co-design and edge-specific software tools. It consolidates taxonomy, evaluation metrics, and performance trade-offs across pruning, KD, quantization, and acceleration, and outlines challenges such as sparsity handling, mixed precision, NAS-driven optimization, and real-world benchmarking. The work highlights that combining compression with hardware-aware acceleration enables practical edge ViT deployment across GPUs, CPUs, FPGAs, and ASICs, while pointing to NAS and automated edge-aware compression as key directions for future research.

Abstract

In recent years, vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks such as image classification, object detection, and segmentation. Unlike convolutional neural networks (CNNs), which rely on hierarchical feature extraction, ViTs treat images as sequences of patches and leverage self-attention mechanisms. However, their high computational complexity and memory demands pose significant challenges for deployment on resource-constrained edge devices. To address these limitations, extensive research has focused on model compression techniques and hardware-aware acceleration strategies. Nonetheless, a comprehensive review that systematically categorizes these techniques and their trade-offs in accuracy, efficiency, and hardware adaptability for edge deployment remains lacking. This survey bridges this gap by providing a structured analysis of model compression techniques, software tools for inference on edge, and hardware acceleration strategies for ViTs. We discuss their impact on accuracy, efficiency, and hardware adaptability, highlighting key challenges and emerging research directions to advance ViT deployment on edge platforms, including graphics processing units (GPUs), application-specific integrated circuit (ASICs), and field-programmable gate arrays (FPGAs). The goal is to inspire further research with a contemporary guide on optimizing ViTs for efficient deployment on edge devices.

Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies

TL;DR

Vision Transformers offer strong performance in computer vision but face prohibitive compute and memory demands on edge devices. This survey systematically catalogues ViT-focused model compression (pruning, knowledge distillation, quantization) and hardware-aware acceleration strategies, emphasizing SW-HW co-design and edge-specific software tools. It consolidates taxonomy, evaluation metrics, and performance trade-offs across pruning, KD, quantization, and acceleration, and outlines challenges such as sparsity handling, mixed precision, NAS-driven optimization, and real-world benchmarking. The work highlights that combining compression with hardware-aware acceleration enables practical edge ViT deployment across GPUs, CPUs, FPGAs, and ASICs, while pointing to NAS and automated edge-aware compression as key directions for future research.

Abstract

In recent years, vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks such as image classification, object detection, and segmentation. Unlike convolutional neural networks (CNNs), which rely on hierarchical feature extraction, ViTs treat images as sequences of patches and leverage self-attention mechanisms. However, their high computational complexity and memory demands pose significant challenges for deployment on resource-constrained edge devices. To address these limitations, extensive research has focused on model compression techniques and hardware-aware acceleration strategies. Nonetheless, a comprehensive review that systematically categorizes these techniques and their trade-offs in accuracy, efficiency, and hardware adaptability for edge deployment remains lacking. This survey bridges this gap by providing a structured analysis of model compression techniques, software tools for inference on edge, and hardware acceleration strategies for ViTs. We discuss their impact on accuracy, efficiency, and hardware adaptability, highlighting key challenges and emerging research directions to advance ViT deployment on edge platforms, including graphics processing units (GPUs), application-specific integrated circuit (ASICs), and field-programmable gate arrays (FPGAs). The goal is to inspire further research with a contemporary guide on optimizing ViTs for efficient deployment on edge devices.

Paper Structure

This paper contains 48 sections, 1 equation, 12 figures, 18 tables, 1 algorithm.

Figures (12)

  • Figure 1: (a) The prevalence of transformer-based models in computer vision has led to a substantial increase in research publications. (b) Given their high computational complexity, model compression techniques are critical for reducing redundancy and improving efficiency. These advancements are essential for optimizing ViTs for hardware acceleration and real-world deployment on resource-constrained platforms dimensions_data.
  • Figure 2: Core components and tools are covered in this survey. This survey focuses on three main parts—model compression techniques (pruning, KD, and quantization), accelerating techniques (using FPGA, GPU, and ASIC), and associated software tools (toolkits, libraries, and inference engines)—to provide a comprehensive understanding of efficient ViT deployment.
  • Figure 3: Static Pruning vs Dynamic Pruning
  • Figure 4: The overview of importance-based pruning techniques for ViTs. VTP zhu2021vision prunes dimensionality in the multi-head self-attention (MSA) and multi-layer perception (MLP) modules using important scores. WDPruning yu2022width applies a binary mask ($N$) to the MSA module (a), followed by pruning of (b) attention heads and (c) linear projection channels. Patch Slimming tang2022patch removes redundant patches in a top-down, layer-wise manner.
  • Figure 5: The overview of token pruning based on SP-ViT kong2022spvit.
  • ...and 7 more figures