Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer
Richie Li, Sicheng Chen
TL;DR
This work tackles the bottleneck of Q, K, V projections in Transformer self-attention on edge FPGAs by engineering a tiled dense GEMM accelerator for DistilBERT on the Xilinx KV260. Using two-level tiling, on-chip data persistence, and a systolic-like compute engine, the design delivers high throughput at 100 MHz and demonstrates up to $7\times$ speedup over CPU PyTorch and up to $214\times$ over naive NumPy for core GEMMs, with a core throughput of $3.12$ GFLOPs for matrices sized $64×768$ and $768×3072$. The solution includes an HLS-based implementation, AXI/PYNQ integration, and a quantized DistilBERT validation, all released as open source to enable reproducibility and further research. End-to-end DistilBERT performance improves by about $2\times$, although data-transfer overhead via the PYNQ overlay limits gains, underscoring the importance of memory hierarchy and system-level optimizations for future improvements. Overall, the work proves the practicality of FPGA-based acceleration for critical Transformer operations on edge devices and provides a scalable blueprint for extending to larger models and components such as softmax and FFN.
Abstract
Transformer-based large language models (LLMs) rely heavily on intensive matrix multiplications for attention and feed-forward layers, with the Q, K, and V linear projections in the Multi-Head Self-Attention (MHA) module constituting a decisive performance bottleneck. In this work, we introduce a highly optimized tiled matrix multiplication accelerator on a resource-constrained Xilinx KV260 FPGA that not only addresses this challenge but sets a new standard for efficiency and performance. Our design exploits persistent on-chip storage, a robust two-level tiling strategy for maximal data reuse, and a systolic-like unrolled compute engine that together deliver unparalleled speed and energy efficiency. Integrated with DistilBERT for Q, K, and V projections, our accelerator achieves an unequivocal 7x speedup over ARM CPU implementations (PyTorch) and an extraordinary 200x improvement over naive NumPy, reaching a throughput of up to 3.1~GFLOPs for matrix multiplications on (64,768) x (768,3072) matrices while operating at a conservative 100 MHz. These results decisively demonstrate the transformative potential of FPGA-based acceleration for critical Transformer operations, paving the way for scalable and energy-efficient deep learning inference on edge devices.
