Table of Contents
Fetching ...

COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference

Ye Qiao, Zhiheng Chen, Yian Wang, Yifan Zhang, Yunzhe Deng, Sitao Huang

TL;DR

COBRA presents an algorithm-architecture co-optimized binary Transformer accelerator tailored for edge inference. It introduces a hardware-friendly binary attention mechanism, Shifted Polarized Softmax (SPS), and a Real 1-bit Binary Matrix Multiplication (RBMM) engine, integrated with memory-efficient data packing for edge FPGAs. The approach achieves up to 3.89k GOPS throughput and 448.7 GOPS/W on edge devices, delivering about 311× energy efficiency improvement over GPUs and a 3.5× throughput increase over prior binary accelerators with negligible accuracy loss. The work demonstrates practical, scalable edge deployment of binary Transformers and provides a reusable accelerator architecture suitable for resource-constrained platforms.

Abstract

Transformer-based models have demonstrated superior performance in various fields, including natural language processing and computer vision. However, their enormous model size and high demands in computation, memory, and communication limit their deployment to edge platforms for local, secure inference. Binary transformers offer a compact, low-complexity solution for edge deployment with reduced bandwidth needs and acceptable accuracy. However, existing binary transformers perform inefficiently on current hardware due to the lack of binary specific optimizations. To address this, we introduce COBRA, an algorithm-architecture co-optimized binary Transformer accelerator for edge computing. COBRA features a real 1-bit binary multiplication unit, enabling matrix operations with -1, 0, and +1 values, surpassing ternary methods. With further hardware-friendly optimizations in the attention block, COBRA achieves up to 3,894.7 GOPS throughput and 448.7 GOPS/Watt energy efficiency on edge FPGAs, delivering a 311x energy efficiency improvement over GPUs and a 3.5x throughput improvement over the state-of-the-art binary accelerator, with only negligible inference accuracy degradation.

COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference

TL;DR

COBRA presents an algorithm-architecture co-optimized binary Transformer accelerator tailored for edge inference. It introduces a hardware-friendly binary attention mechanism, Shifted Polarized Softmax (SPS), and a Real 1-bit Binary Matrix Multiplication (RBMM) engine, integrated with memory-efficient data packing for edge FPGAs. The approach achieves up to 3.89k GOPS throughput and 448.7 GOPS/W on edge devices, delivering about 311× energy efficiency improvement over GPUs and a 3.5× throughput increase over prior binary accelerators with negligible accuracy loss. The work demonstrates practical, scalable edge deployment of binary Transformers and provides a reusable accelerator architecture suitable for resource-constrained platforms.

Abstract

Transformer-based models have demonstrated superior performance in various fields, including natural language processing and computer vision. However, their enormous model size and high demands in computation, memory, and communication limit their deployment to edge platforms for local, secure inference. Binary transformers offer a compact, low-complexity solution for edge deployment with reduced bandwidth needs and acceptable accuracy. However, existing binary transformers perform inefficiently on current hardware due to the lack of binary specific optimizations. To address this, we introduce COBRA, an algorithm-architecture co-optimized binary Transformer accelerator for edge computing. COBRA features a real 1-bit binary multiplication unit, enabling matrix operations with -1, 0, and +1 values, surpassing ternary methods. With further hardware-friendly optimizations in the attention block, COBRA achieves up to 3,894.7 GOPS throughput and 448.7 GOPS/Watt energy efficiency on edge FPGAs, delivering a 311x energy efficiency improvement over GPUs and a 3.5x throughput improvement over the state-of-the-art binary accelerator, with only negligible inference accuracy degradation.

Paper Structure

This paper contains 29 sections, 13 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: General structure of binarized BERT with our Shifted Polarized Softmax (SPS) and Real Binary RBMM Engine
  • Figure 2: Shifted Polarized Softmax (SPS) Search
  • Figure 3: Similarity and Correlation Comparisons Between BiT(with Regular Softmax) and SPS Attention
  • Figure 4: A 6-bit toy example of our real binary dot product (RBVM).
  • Figure 5: COD-BT Hardware Architecture Overview.
  • ...and 2 more figures