Table of Contents
Fetching ...

BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge

Yuhao Ji, Chao Fang, Zhongfeng Wang

TL;DR

This work tackles the challenge of edge deployment for binarized transformers by addressing inefficient QMM and energy overhead from multi-precision activations. It introduces a computation flow abstraction to reduce full-precision operations and a binarized Transformer accelerator, BETA, with a configurable QMM engine and high parallelism enabled by unfolding in DPUs. Key contributions include the abstraction that rewrites mixed-precision operations into cheaper components, a QMM engine supporting multiple activation precisions, and the ability to dynamically trade off efficiency and accuracy at the edge, demonstrated on a ZCU102 FPGA with $174$ GOPS/W and substantial gains over prior FPGA accelerators. The work highlights the practical potential of edge Transformer acceleration through precise computation reordering and flexible hardware design.

Abstract

Existing binary Transformers are promising in edge deployment due to their compact model size, low computational complexity, and considerable inference accuracy. However, deploying binary Transformers faces challenges on prior processors due to inefficient execution of quantized matrix multiplication (QMM) and the energy consumption overhead caused by multi-precision activations. To tackle the challenges above, we first develop a computation flow abstraction method for binary Transformers to improve QMM execution efficiency by optimizing the computation order. Furthermore, a binarized energy-efficient Transformer accelerator, namely BETA, is proposed to boost the efficient deployment at the edge. Notably, BETA features a configurable QMM engine, accommodating diverse activation precisions of binary Transformers and offering high-parallelism and high-speed for QMMs with impressive energy efficiency. Experimental results evaluated on ZCU102 FPGA show BETA achieves an average energy efficiency of 174 GOPS/W, which is 1.76~21.92x higher than prior FPGA-based accelerators, showing BETA's good potential for edge Transformer acceleration.

BETA: Binarized Energy-Efficient Transformer Accelerator at the Edge

TL;DR

This work tackles the challenge of edge deployment for binarized transformers by addressing inefficient QMM and energy overhead from multi-precision activations. It introduces a computation flow abstraction to reduce full-precision operations and a binarized Transformer accelerator, BETA, with a configurable QMM engine and high parallelism enabled by unfolding in DPUs. Key contributions include the abstraction that rewrites mixed-precision operations into cheaper components, a QMM engine supporting multiple activation precisions, and the ability to dynamically trade off efficiency and accuracy at the edge, demonstrated on a ZCU102 FPGA with GOPS/W and substantial gains over prior FPGA accelerators. The work highlights the practical potential of edge Transformer acceleration through precise computation reordering and flexible hardware design.

Abstract

Existing binary Transformers are promising in edge deployment due to their compact model size, low computational complexity, and considerable inference accuracy. However, deploying binary Transformers faces challenges on prior processors due to inefficient execution of quantized matrix multiplication (QMM) and the energy consumption overhead caused by multi-precision activations. To tackle the challenges above, we first develop a computation flow abstraction method for binary Transformers to improve QMM execution efficiency by optimizing the computation order. Furthermore, a binarized energy-efficient Transformer accelerator, namely BETA, is proposed to boost the efficient deployment at the edge. Notably, BETA features a configurable QMM engine, accommodating diverse activation precisions of binary Transformers and offering high-parallelism and high-speed for QMMs with impressive energy efficiency. Experimental results evaluated on ZCU102 FPGA show BETA achieves an average energy efficiency of 174 GOPS/W, which is 1.76~21.92x higher than prior FPGA-based accelerators, showing BETA's good potential for edge Transformer acceleration.
Paper Structure (12 sections, 5 figures, 2 tables)

This paper contains 12 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of MHA and FFN blocks in (a) vanilla Transformer and (b) binary Transformer, respectively.
  • Figure 2: An example of binary activation$\times$weight operation $(\alpha A + \gamma \cdot \mathbf{1}) \times \beta W$ and its computation flow abstraction process together with corresponding computational complexity. Full-precision number $\alpha,\beta$ serve as coefficients, $\gamma$ serves as offset, and $A, W$ are binary matrices. Op denotes full-precision operation and Iop denotes integer operation.
  • Figure 3: (a) Hardware architecture of BETA, where the orange arrows pass control signals, and the black arrows transfer data. (b) Detailed structure of dot product unit, which consists of the PE sequence and compressor tree loop.
  • Figure 4: Operation modes of configurable PE sequence, which combines data-packing and bit-serial to enable flexible configuration to process different workload.Note that a network with weights quantized to $b_w$ bits and activations quantized to $b_a$ bits is denoted as $Wb_wAb_a$BiT.
  • Figure 5: Tradeoff between hardware efficiency and model accuracy on BETA.