Table of Contents
Fetching ...

Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

Dhruv Parikh, Shouyi Li, Bingyi Zhang, Rajgopal Kannan, Carl Busart, Viktor Prasanna

TL;DR

This work tackles the high computational burden of Vision Transformers by proposing a joint algorithm-hardware codesign for FPGA that combines static block weight pruning with dynamic token pruning. The algorithm introduces a simultaneous pruning training regime to recover accuracy, while the hardware design features a multi-level MPCA architecture and a dedicated token dropping module to handle irregular patterns. Empirical results on DeiT-Small show up to 3.4x compute reduction with about 3% accuracy loss and up to 1.6x model compression, with substantial latency improvements over CPU, GPU, and prior FPGA ViT accelerators. The approach demonstrates the practicality of deploying highly pruned ViTs on FPGA and offers a path toward automated, platform-specific code generation for future deployments.

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamically reduces the computation based on the input. Combining these two techniques should significantly reduce computation complexity and model size; however, naively integrating them results in irregular computation patterns, leading to significant accuracy drops and difficulties in hardware acceleration. Addressing the above challenges, we propose a comprehensive algorithm-hardware codesign for accelerating ViT on FPGA through simultaneous pruning -combining static weight pruning and dynamic token pruning. For algorithm design, we systematically combine a hardware-aware structured block-pruning method for pruning model parameters and a dynamic token pruning method for removing unimportant token vectors. Moreover, we design a novel training algorithm to recover the model's accuracy. For hardware design, we develop a novel hardware accelerator for executing the pruned model. The proposed hardware design employs multi-level parallelism with load balancing strategy to efficiently deal with the irregular computation pattern led by the two pruning approaches. Moreover, we develop an efficient hardware mechanism for efficiently executing the on-the-fly token pruning.

Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

TL;DR

This work tackles the high computational burden of Vision Transformers by proposing a joint algorithm-hardware codesign for FPGA that combines static block weight pruning with dynamic token pruning. The algorithm introduces a simultaneous pruning training regime to recover accuracy, while the hardware design features a multi-level MPCA architecture and a dedicated token dropping module to handle irregular patterns. Empirical results on DeiT-Small show up to 3.4x compute reduction with about 3% accuracy loss and up to 1.6x model compression, with substantial latency improvements over CPU, GPU, and prior FPGA ViT accelerators. The approach demonstrates the practicality of deploying highly pruned ViTs on FPGA and offers a path toward automated, platform-specific code generation for future deployments.

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamically reduces the computation based on the input. Combining these two techniques should significantly reduce computation complexity and model size; however, naively integrating them results in irregular computation patterns, leading to significant accuracy drops and difficulties in hardware acceleration. Addressing the above challenges, we propose a comprehensive algorithm-hardware codesign for accelerating ViT on FPGA through simultaneous pruning -combining static weight pruning and dynamic token pruning. For algorithm design, we systematically combine a hardware-aware structured block-pruning method for pruning model parameters and a dynamic token pruning method for removing unimportant token vectors. Moreover, we design a novel training algorithm to recover the model's accuracy. For hardware design, we develop a novel hardware accelerator for executing the pruned model. The proposed hardware design employs multi-level parallelism with load balancing strategy to efficiently deal with the irregular computation pattern led by the two pruning approaches. Moreover, we develop an efficient hardware mechanism for efficiently executing the on-the-fly token pruning.
Paper Structure (35 sections, 9 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 35 sections, 9 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of the proposed algorithm-hardware codesign
  • Figure 2: Alternate pattern of block pruning for $\mathbf{W}_p$ and $\mathbf{W}_{\text{proj}}$ parameters.
  • Figure 3: Alternate column-wise/row-wise pruning for $\mathbf{W}_{\text{int}}$ and $\mathbf{W}_{\text{out}}$. Note that $D_{mlp}$ is much larger than $D$.
  • Figure 4: TDM inserted between the MSA and MLP block inside an encoder. TDM updates the input to the MLP block, $\mathbf{Z}_{l}{'}$, as $\mathbf{Z}_{l}{'} \leftarrow \hbox{TDM}(\mathbf{Z}_{l}{'})$.
  • Figure 5: Data layout of dense token matrix and sparse weight matrix.
  • ...and 5 more figures