Table of Contents
Fetching ...

Real-Time Stream Compaction for Sparse Machine Learning on FPGAs

Marc Neu, Isabel Haide, Torben Ferber, Jürgen Becker

TL;DR

This work proposes a concept for latency-optimized preprocessing of sparse sensor data, enabling efficient GNN hardware acceleration by removing dynamic input sparsity, and developed a hierarchical sparsity compression pipeline optimized for FPGAs.

Abstract

Machine learning algorithms are being used more frequently in the first-level triggers in collider experiments, with Graph Neural Networks pushing the hardware requirements of FPGA-based triggers beyond the current state of the art. To meet the stringent demands of high-throughput and low-latency environments, we propose a concept for latency-optimized preprocessing of sparse sensor data, enabling efficient GNN hardware acceleration by removing dynamic input sparsity. Our approach rearranges data coming from a large number of First-In-First-Out interfaces, typically sensor frontends, to a smaller number of FIFO interfaces connected to a machine learning hardware accelerator. In order to achieve high throughput while minimizing the hardware utilization, we developed a hierarchical sparsity compression pipeline optimized for FPGAs. We implemented our concept in the Chisel design language as an open-source hardware generator. For demonstration, we implemented one configuration of our module as preprocessing stage in a GNN-based first-level trigger for the Electromagnetic Calorimeter inside the Belle II detector. Additionally we evaluate latency, throughput, resource utilization, and scalability for a wide range of parameters, to enable broader use for other large scale scientific experiments.

Real-Time Stream Compaction for Sparse Machine Learning on FPGAs

TL;DR

This work proposes a concept for latency-optimized preprocessing of sparse sensor data, enabling efficient GNN hardware acceleration by removing dynamic input sparsity, and developed a hierarchical sparsity compression pipeline optimized for FPGAs.

Abstract

Machine learning algorithms are being used more frequently in the first-level triggers in collider experiments, with Graph Neural Networks pushing the hardware requirements of FPGA-based triggers beyond the current state of the art. To meet the stringent demands of high-throughput and low-latency environments, we propose a concept for latency-optimized preprocessing of sparse sensor data, enabling efficient GNN hardware acceleration by removing dynamic input sparsity. Our approach rearranges data coming from a large number of First-In-First-Out interfaces, typically sensor frontends, to a smaller number of FIFO interfaces connected to a machine learning hardware accelerator. In order to achieve high throughput while minimizing the hardware utilization, we developed a hierarchical sparsity compression pipeline optimized for FPGAs. We implemented our concept in the Chisel design language as an open-source hardware generator. For demonstration, we implemented one configuration of our module as preprocessing stage in a GNN-based first-level trigger for the Electromagnetic Calorimeter inside the Belle II detector. Additionally we evaluate latency, throughput, resource utilization, and scalability for a wide range of parameters, to enable broader use for other large scale scientific experiments.
Paper Structure (5 sections, 1 equation, 6 figures)

This paper contains 5 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Histogram of input data density per event as received by the GNN-ETM module in the first-level trigger system at Belle II. The data density is defined as the fraction of non-zero data values out of all possible data values in an event.
  • Figure 2: Overview of the Graph Neural Network ECL Trigger Module inside the Belle II first-level trigger system. This diagram is simplified to give an conceptual overview of the sparsity compression module inside the system.
  • Figure 3: Typical application of the sparsity compression module, illustrated using the Belle II ECL as an example. Sparse data are provided by the detector frontend and compressed for subsequent processing in a dataflow accelerator.
  • Figure 4: Conceptual block-level diagram of our sparsity compression hardware module. In the exemplary case, the following values are chosen: $N_I = 5$, $N_O = 2$, and $D =5$.
  • Figure 5: Relative system resource utilization for various module configurations after out-of-context synthesis, place and route with AMD Vivado 2024.2 for the AMD Ultrascale XCVU190.
  • ...and 1 more figures