LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics

Zhiqiang Que; Hongxiang Fan; Marcus Loo; He Li; Michaela Blott; Maurizio Pierini; Alexander Tapper; Wayne Luk

LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics

Zhiqiang Que, Hongxiang Fan, Marcus Loo, He Li, Michaela Blott, Maurizio Pierini, Alexander Tapper, Wayne Luk

TL;DR

This paper tackles the challenge of real-time GNN inference for collider triggers by designing LL-GNN, an FPGA-based accelerator for JEDI-net that achieves sub-microsecond latency ($<1\mu s$) under strict LHC trigger constraints. The approach combines an outer-product based MMM, a column-major data layout, sparsity-exploiting MMMs, and a fusion-driven dataflow with FSM-based imperfect-loop handling, all under a two-level co-design framework. Empirical results show up to 9.0× faster performance than GPUs and up to 13.1× energy efficiency improvements, with sub-μs latencies even for larger models (e.g., JEDI-net-50p) and modest accuracy trade-offs, enabling effective online particle identification. The work also provides open-source templates to generate low-latency FPGA designs, with clear implications for next-generation collider triggers and potentially broader GNN applications in real-time systems.

Abstract

This work presents a novel reconfigurable architecture for Low Latency Graph Neural Network (LL-GNN) designs for particle detectors, delivering unprecedented low latency performance. Incorporating FPGA-based GNNs into particle detectors presents a unique challenge since it requires sub-microsecond latency to deploy the networks for online event selection with a data rate of hundreds of terabytes per second in the Level-1 triggers at the CERN Large Hadron Collider experiments. This paper proposes a novel outer-product based matrix multiplication approach, which is enhanced by exploiting the structured adjacency matrix and a column-major data layout. Moreover, a fusion step is introduced to further reduce the end-to-end design latency by eliminating unnecessary boundaries. Furthermore, a GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under given latency constraints. To facilitate this, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 9.0 times faster and achieves up to 13.1 times higher power efficiency than a GPU implementation. Compared to the previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy. The proposed LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.

LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics

TL;DR

This paper tackles the challenge of real-time GNN inference for collider triggers by designing LL-GNN, an FPGA-based accelerator for JEDI-net that achieves sub-microsecond latency (

) under strict LHC trigger constraints. The approach combines an outer-product based MMM, a column-major data layout, sparsity-exploiting MMMs, and a fusion-driven dataflow with FSM-based imperfect-loop handling, all under a two-level co-design framework. Empirical results show up to 9.0× faster performance than GPUs and up to 13.1× energy efficiency improvements, with sub-μs latencies even for larger models (e.g., JEDI-net-50p) and modest accuracy trade-offs, enabling effective online particle identification. The work also provides open-source templates to generate low-latency FPGA designs, with clear implications for next-generation collider triggers and potentially broader GNN applications in real-time systems.

Abstract

Paper Structure (30 sections, 2 equations, 13 figures, 3 tables, 3 algorithms)

This paper contains 30 sections, 2 equations, 13 figures, 3 tables, 3 algorithms.

Introduction
Background
Graph Neural Network and Interaction Network
JEDI-net for Particle Identification
Design and Optimization
Outer-product Based Matrix Multiplication
Column-major Order
Custom MMMs for GNN Feature Transformation
A Dataflow Architecture with Task-level Parallelism
Divide, Conquer and Fuse
Handling Imperfect Loops
Implementation and co-design framework
Two-level Parallelism
Resource Model
Latency Model
...and 15 more sections

Figures (13)

Figure 1: (a) An example of interaction network-based GNN with edge block, aggregation, node block and MLP-head. (b) Overview of the JEDI-net architecture.)
Figure 2: An example of an interaction-network based fully-connected graph with 4 nodes and the corresponding 12 uni-directional edges (left) with its receiving matrix $R_r$ as well as the sending matrix $R_s$ (right).
Figure 3: Outer-product based matrix multiplication with column-major order and structured sparsity.
Figure 4: Row-major (left) and column-major (right) orders.
Figure 5: The dataflow of JEDI-net.
...and 8 more figures

LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics

TL;DR

Abstract

LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics

Authors

TL;DR

Abstract

Table of Contents

Figures (13)