Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml

Zhixing Jiang; Dennis Yin; Yihui Chen; Elham E Khoda; Scott Hauck; Shih-Chieh Hsu; Ekaterina Govorkova; Philip Harris; Vladimir Loncar; Eric A. Moreno

Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml

Zhixing Jiang, Dennis Yin, Yihui Chen, Elham E Khoda, Scott Hauck, Shih-Chieh Hsu, Ekaterina Govorkova, Philip Harris, Vladimir Loncar, Eric A. Moreno

TL;DR

This work demonstrates a practical pathway for deploying transformer models on FPGAs using the hls4ml framework to achieve real-time, low-latency inference in physics-related applications. By auto-converting TensorFlow-built transformers into FPGA-friendly implementations and optimizing the MHA, SoftMax, and Layer Normalization pipelines, the authors attain ultra-low latency on a Xilinx UltraScale device with fixed-point quantization and careful reuse-based parallelization. The study benchmarks three distinct tasks—engine anomaly detection, B-tagging, and gravitational-wave classification—demonstrating competitive accuracy and AUC while detailing resource-latency trade-offs and memory architectures. These results highlight the practical impact of hardware-accelerated transformers for high-throughput, data-intensive domains such as high-energy physics and gravitational-wave analysis, and provide actionable guidance on quantization and memory design for FPGA deployments.

Abstract

This study presents an efficient implementation of transformer architectures in Field-Programmable Gate Arrays(FPGAs) using hls4ml. We demonstrate the strategy for implementing the multi-head attention, softmax, and normalization layer and evaluate three distinct models. Their deployment on VU13P FPGA chip achieved latency less than 2us, demonstrating the potential for real-time applications. HLS4ML compatibility with any TensorFlow-built transformer model further enhances the scalability and applicability of this work. Index Terms: FPGAs, machine learning, transformers, high energy physics, LIGO

Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml

TL;DR

Abstract

Paper Structure (16 sections, 10 equations, 14 figures, 4 tables)

This paper contains 16 sections, 10 equations, 14 figures, 4 tables.

Introduction
Background
Detailed Description of Transformer Architecture
Related Work
Implementation Detail
Multi-Head Attention Layer
SoftMax Layer
Layer Normalization Layer
Benchmark Studies
Engine Anomaly Detection Model
B-Tagging Model
Gravitational Wave Model
Performance, Resource and Latency Estimation
Quantization
Parallelization
...and 1 more sections

Figures (14)

Figure 1: The workflow of hls4mlDuarte:2018ite
Figure 2: The architecture of the transformer model NIPS2017_3f5ee243
Figure 3: One transformer block. The green layers are existing hls4ml functionality, while the blue are new in this paper.
Figure 4: The pipeline stages for the MHA layer
Figure 5: The data streaming structure between layers using FIFO memory
...and 9 more figures

Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml

TL;DR

Abstract

Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml

Authors

TL;DR

Abstract

Table of Contents

Figures (14)