Accelerating Vision Transformers on Brain Processing Unit
Jinchi Tang, Yan Guo
TL;DR
The paper tackles deploying Vision Transformers on Brain Processing Units by reworking DeiT so linear and LayerNorm operations are replaced with convolution-based equivalents, enabling full hardware acceleration without retraining. Through INT8 quantization with Horizon’s toolchain and a carefully constructed calibration workflow, the authors achieve up to $3.8\times$ speedups on the BPU while preserving most of the original accuracy on ImageNet (e.g., $80.4\%$ Top-1 for quantized DeiT-Base vs $81.8\%$ baseline). They validate on ImageNet and a Flower dataset, showing robust quantization resilience, especially in distilled variants, and demonstrate practical edge deployment potential. The work claims the first successful DeiT deployment on a BPU, highlighting significant implications for deploying ViTs on resource-constrained embedded platforms. Overall, the approach enables efficient, hardware-tailored ViT inference with minimal accuracy loss, expanding the applicability of Vision Transformers to edge devices.
Abstract
With the advancement of deep learning technologies, specialized neural processing hardware such as Brain Processing Units (BPUs) have emerged as dedicated platforms for CNN acceleration, offering optimized INT8 computation capabilities for convolutional operations. Meanwhile, Vision Transformer (ViT) models, such as the Data-efficient Image Transformer (DeiT), have demonstrated superior performance and play increasingly crucial roles in computer vision tasks. However, due to the architectural mismatch between CNN-optimized hardware and Vision Transformer computation characteristics--namely, that linear layers in Transformers operate on three-dimensional data while BPU acceleration is designed for four-dimensional convolution operations-it is difficult or even impossible to leverage BPU's advantages when deploying Vision Transformers. To address this challenge, we propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This enables DeiT to fully utilize the acceleration capabilities of BPUs, while allowing the original weight parameters to be inherited by the restructured models without retraining or fine-tuning. To the best of our knowledge, this is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach. Specifically, the quantized DeiT-Base model achieves 80.4% accuracy on ImageNet, compared to the original 81.8%, while obtaining up to a 3.8* inference speedup. Our finetuned DeiT model on the flower classification dataset also achieves excellent performance, with only a 0.5% accuracy drop for the DeiT-Base model, further demonstrating the effectiveness of our method.
