Table of Contents
Fetching ...

Ultra Fast Transformers on FPGAs for Particle Physics Experiments

Zhixing Jiang, Dennis Yin, Elham E Khoda, Vladimir Loncar, Ekaterina Govorkova, Eric Moreno, Philip Harris, Scott Hauck, Shih-Chieh Hsu

TL;DR

This work tackles the need for ultra-low-latency transformer inference in online LHC triggers by implementing a transformer with multi-head attention on an FPGA via the hls4ml framework. It delivers a four-stage MHA pipeline with LUT-based softmax and fixed-point quantization, integrated into hls4ml to enable real-time deployment in particle-physics detectors. Using a CMS jet-flavor tagging dataset, the study demonstrates sub- to a few-microsecond latency on a Xilinx UltraScale+ device, with a clear resource-latency trade-off via the reuse factor and fixed-point precision, achieving near FP accuracy with $10$ integer and $10$ fractional bits. The results indicate that transformer-based inference can be feasibly embedded in hardware triggers, offering a broadly applicable path for real-time, low-latency inference in high-energy physics and other scientific domains.

Abstract

This work introduces a highly efficient implementation of the transformer architecture on a Field-Programmable Gate Array (FPGA) by using the \texttt{hls4ml} tool. Given the demonstrated effectiveness of transformer models in addressing a wide range of problems, their application in experimental triggers within particle physics becomes a subject of significant interest. In this work, we have implemented critical components of a transformer model, such as multi-head attention and softmax layers. To evaluate the effectiveness of our implementation, we have focused on a particle physics jet flavor tagging problem, employing a public dataset. We recorded latency under 2 $μ$s on the Xilinx UltraScale+ FPGA, which is compatible with hardware trigger requirements at the CERN Large Hadron Collider experiments.

Ultra Fast Transformers on FPGAs for Particle Physics Experiments

TL;DR

This work tackles the need for ultra-low-latency transformer inference in online LHC triggers by implementing a transformer with multi-head attention on an FPGA via the hls4ml framework. It delivers a four-stage MHA pipeline with LUT-based softmax and fixed-point quantization, integrated into hls4ml to enable real-time deployment in particle-physics detectors. Using a CMS jet-flavor tagging dataset, the study demonstrates sub- to a few-microsecond latency on a Xilinx UltraScale+ device, with a clear resource-latency trade-off via the reuse factor and fixed-point precision, achieving near FP accuracy with integer and fractional bits. The results indicate that transformer-based inference can be feasibly embedded in hardware triggers, offering a broadly applicable path for real-time, low-latency inference in high-energy physics and other scientific domains.

Abstract

This work introduces a highly efficient implementation of the transformer architecture on a Field-Programmable Gate Array (FPGA) by using the \texttt{hls4ml} tool. Given the demonstrated effectiveness of transformer models in addressing a wide range of problems, their application in experimental triggers within particle physics becomes a subject of significant interest. In this work, we have implemented critical components of a transformer model, such as multi-head attention and softmax layers. To evaluate the effectiveness of our implementation, we have focused on a particle physics jet flavor tagging problem, employing a public dataset. We recorded latency under 2 s on the Xilinx UltraScale+ FPGA, which is compatible with hardware trigger requirements at the CERN Large Hadron Collider experiments.
Paper Structure (6 sections, 2 figures)

This paper contains 6 sections, 2 figures.

Figures (2)

  • Figure 1: The encoder block used for the transformer model is shown in (a). The full model architecture is shown in (b). The pipeline stages for the multi-head attention layer is shown in (c).
  • Figure 2: (a) Ratios of the fixed-point and floating-point AUCs as function of fractional bits. Five different values between 6 and 10 bits are chosen for the integer precision. Utilization of (b) DSP and (c) Lookup tables are shown as a function of fractional bits while keeping the integer part fixed to 10. Three different configurations with reuse factor of 1 (blue), 2 (orange), or 4 (green) are shown. The target board (part number xcvu13p-fhga2104-2L-e) has a total of 12288 DSPs and 1.72 million LUTs.