Accelerating Vision Transformers on Brain Processing Unit

Jinchi Tang; Yan Guo

Accelerating Vision Transformers on Brain Processing Unit

Jinchi Tang, Yan Guo

TL;DR

The paper tackles deploying Vision Transformers on Brain Processing Units by reworking DeiT so linear and LayerNorm operations are replaced with convolution-based equivalents, enabling full hardware acceleration without retraining. Through INT8 quantization with Horizon’s toolchain and a carefully constructed calibration workflow, the authors achieve up to $3.8\times$ speedups on the BPU while preserving most of the original accuracy on ImageNet (e.g., $80.4\%$ Top-1 for quantized DeiT-Base vs $81.8\%$ baseline). They validate on ImageNet and a Flower dataset, showing robust quantization resilience, especially in distilled variants, and demonstrate practical edge deployment potential. The work claims the first successful DeiT deployment on a BPU, highlighting significant implications for deploying ViTs on resource-constrained embedded platforms. Overall, the approach enables efficient, hardware-tailored ViT inference with minimal accuracy loss, expanding the applicability of Vision Transformers to edge devices.

Abstract

With the advancement of deep learning technologies, specialized neural processing hardware such as Brain Processing Units (BPUs) have emerged as dedicated platforms for CNN acceleration, offering optimized INT8 computation capabilities for convolutional operations. Meanwhile, Vision Transformer (ViT) models, such as the Data-efficient Image Transformer (DeiT), have demonstrated superior performance and play increasingly crucial roles in computer vision tasks. However, due to the architectural mismatch between CNN-optimized hardware and Vision Transformer computation characteristics--namely, that linear layers in Transformers operate on three-dimensional data while BPU acceleration is designed for four-dimensional convolution operations-it is difficult or even impossible to leverage BPU's advantages when deploying Vision Transformers. To address this challenge, we propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This enables DeiT to fully utilize the acceleration capabilities of BPUs, while allowing the original weight parameters to be inherited by the restructured models without retraining or fine-tuning. To the best of our knowledge, this is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach. Specifically, the quantized DeiT-Base model achieves 80.4% accuracy on ImageNet, compared to the original 81.8%, while obtaining up to a 3.8* inference speedup. Our finetuned DeiT model on the flower classification dataset also achieves excellent performance, with only a 0.5% accuracy drop for the DeiT-Base model, further demonstrating the effectiveness of our method.

Accelerating Vision Transformers on Brain Processing Unit

TL;DR

speedups on the BPU while preserving most of the original accuracy on ImageNet (e.g.,

Top-1 for quantized DeiT-Base vs

baseline). They validate on ImageNet and a Flower dataset, showing robust quantization resilience, especially in distilled variants, and demonstrate practical edge deployment potential. The work claims the first successful DeiT deployment on a BPU, highlighting significant implications for deploying ViTs on resource-constrained embedded platforms. Overall, the approach enables efficient, hardware-tailored ViT inference with minimal accuracy loss, expanding the applicability of Vision Transformers to edge devices.

Abstract

Paper Structure (13 sections, 4 figures, 4 tables)

This paper contains 13 sections, 4 figures, 4 tables.

INTRODUCTION
Related Work
Vision Transformers and DeiT
Model Quantization
Proposed BPU-Utilized DeiT Model
BPU-Optimized Operators
Transformer Blocks on the BPU Platform
Experiments
Implementation Details
Model Quantization and Calibration Dataset
Comparison with Original Model
Model Predictions on Annotation Errors
CONCLUSIONS

Figures (4)

Figure 1: Architecture of BPU-optimized LayerNorm implementation using $1 \times 1$ convolutions, illustrating the internal computational structure of LayerNorm on the BPU platform via convolution operations.
Figure 2: Comparison of attention mechanisms. (Left) The proposed BPU-optimized attention mechanism with hardware-friendly reformulation using convolution operations. (Right) The standard attention structure of the original implementation with matrix multiplications and softmax operations.
Figure 3: Comparison of Transformer block architectures. (Left) The BPU-optimized Transformer block with hardware-efficient reformulation for specialized acceleration. (Right) The standard Transformer block structure showing the conventional implementation with typical attention, MLP, LayerNorm, and residual connections.
Figure 4: Examples of misclassified images from the validation set. These samples demonstrate typical cases where the model made incorrect predictions, providing insights into the model's limitations and potential areas for improvement.

Accelerating Vision Transformers on Brain Processing Unit

TL;DR

Abstract

Accelerating Vision Transformers on Brain Processing Unit

Authors

TL;DR

Abstract

Table of Contents

Figures (4)