CAT: Customized Transformer Accelerator Framework on Versal ACAP
Wenbo Zhang, Yiqi Liu, Zhenshan Bao
TL;DR
The paper tackles the challenge of efficiently accelerating Transformer inference by balancing hardware customization and design space complexity. It introduces the CAT framework to derive a family of Transformer accelerators on Versal ACAP, leveraging the AI Engine as the compute core and programmable logic for memory-bound tasks. Through a top-down customization strategy, CAT maps Transformer workloads to EDPU primitives, optimizes AIE MM PU scale, parallel modes, and ATB parallelism, and employs CA-aware design and automated AIE code generation. Evaluations on BERT-Base and ViT-Base show accelerated inference and favorable energy efficiency on the VCK5000 relative to GPU and FPGA baselines, demonstrating strong practical impact for mobile and data-center AI workloads. The work advances ACAP-based Transformer acceleration by providing a general, deployable workflow that co-designs model characteristics with hardware resources to maximize performance and energy efficiency.
Abstract
Transformer uses GPU as the initial design platform, but GPU can only perform limited hardware customization. Although FPGA has strong customization ability, the design solution space is huge and the design difficulty is high. Versal ACAP is a heterogeneous computing architecture with AI Engine as the core. It is far more flexible than GPU in hardware customization, and has better and smaller design solution space than traditional FPGA. Therefore, this paper proposes the Customized Transformer Accelerator Framework(CAT), through the CAT framework, a customized Transformer accelerator family can be derived on Versal ACAP, CAT framework has an abstract accelerator architecture design idea, which deconstructs and efficiently maps the Transformer into the hardware, which contains a variety of customizable properties. Through the customization and optimization strategy of the CAT framework, the underlying hardware and the upper model jointly constrain and decide on these customizable properties, and finally form a customized accelerator. We use a 7 nm AMD Versal ACAP VCK5000 development board to implement accelerators for different Transformer models based on the CAT framework. Experiments show that we achieve the highest throughput gains of 2.41x, 49.50x, and 1.32x compared to 8 nm Nvidia GPU A10G, 16 nm AMD FPGA ZCU102, and 7 nm AMD Versal ACAP VC190(SOTA). The highest energy efficiency gains are 7.80x, 6.19x and 1.15x, respectively.
