Table of Contents
Fetching ...

ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design

Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, Yingyan Celine Lin

TL;DR

ViTCoD tackles the ViT inference bottleneck by combining a fixed sparse attention pruning strategy with a learnable auto-encoder and a specialized two-pronged accelerator. The split-and-conquer algorithm yields two regular workloads (denser and sparser) that reduce attention computations while keeping accuracy high, and the AE module trades data movements for cheaper computations. The hardware design integrates denser/sparser engines and encoder/decoder units to maximize utilization and minimize off-chip traffic, achieving up to $235.3\times$ speedups over CPU and substantial gains versus prior NLP-focused Transformers accelerators, with negligible accuracy loss at high sparsity. This co-design offers a practical path to efficient ViT inference on resource-constrained devices and may guide future sparse ViT hardware and algorithm developments.

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.

ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design

TL;DR

ViTCoD tackles the ViT inference bottleneck by combining a fixed sparse attention pruning strategy with a learnable auto-encoder and a specialized two-pronged accelerator. The split-and-conquer algorithm yields two regular workloads (denser and sparser) that reduce attention computations while keeping accuracy high, and the AE module trades data movements for cheaper computations. The hardware design integrates denser/sparser engines and encoder/decoder units to maximize utilization and minimize off-chip traffic, achieving up to speedups over CPU and substantial gains versus prior NLP-focused Transformers accelerators, with negligible accuracy loss at high sparsity. This co-design offers a practical path to efficient ViT inference on resource-constrained devices and may guide future sparse ViT hardware and algorithm developments.

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
Paper Structure (22 sections, 2 equations, 19 figures, 1 algorithm)

This paper contains 22 sections, 2 equations, 19 figures, 1 algorithm.

Figures (19)

  • Figure 1: Comparison between NLP Transformers and ViTs in terms of BLEU-sparsity or accuracy-sparsity trade-offs. Note that for NLP Transformer, we collect the results on machine translation task, IWSLT EN $\rightarrow$ DE, following treviso2021predicting; For ViTs, we apply an info-based pruning technique on DeiT-Base/Small models and classification task (e.g., ImageNet), following kim2021rethinking.
  • Figure 2: Illustrating the fixed sparse attention mask.
  • Figure 4: The FLOPs (top) and measured latency (bottom) breakdowns of various ViTs on an EdgeGPU TX2 edgegpu, where the self-attention (SA) module denoted by middle bars accounts for over 50% of the total latency.
  • Figure 5: An overview of ViTCoD, the first algorithm-accelerator co-design framework dedicated to sparse ViTs.
  • Figure 6: Illustrating the self-attention workflow and its associated matrix multiplication patterns.
  • ...and 14 more figures