Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml
Zhixing Jiang, Dennis Yin, Yihui Chen, Elham E Khoda, Scott Hauck, Shih-Chieh Hsu, Ekaterina Govorkova, Philip Harris, Vladimir Loncar, Eric A. Moreno
TL;DR
This work demonstrates a practical pathway for deploying transformer models on FPGAs using the hls4ml framework to achieve real-time, low-latency inference in physics-related applications. By auto-converting TensorFlow-built transformers into FPGA-friendly implementations and optimizing the MHA, SoftMax, and Layer Normalization pipelines, the authors attain ultra-low latency on a Xilinx UltraScale device with fixed-point quantization and careful reuse-based parallelization. The study benchmarks three distinct tasks—engine anomaly detection, B-tagging, and gravitational-wave classification—demonstrating competitive accuracy and AUC while detailing resource-latency trade-offs and memory architectures. These results highlight the practical impact of hardware-accelerated transformers for high-throughput, data-intensive domains such as high-energy physics and gravitational-wave analysis, and provide actionable guidance on quantization and memory design for FPGA deployments.
Abstract
This study presents an efficient implementation of transformer architectures in Field-Programmable Gate Arrays(FPGAs) using hls4ml. We demonstrate the strategy for implementing the multi-head attention, softmax, and normalization layer and evaluate three distinct models. Their deployment on VU13P FPGA chip achieved latency less than 2us, demonstrating the potential for real-time applications. HLS4ML compatibility with any TensorFlow-built transformer model further enhances the scalability and applicability of this work. Index Terms: FPGAs, machine learning, transformers, high energy physics, LIGO
