Table of Contents
Fetching ...

Comparative Analysis of FPGA and GPU Performance for Machine Learning-Based Track Reconstruction at LHCb

Fotis I. Giasemis, Vladimir Lončar, Bertrand Granado, Vladimir Vava Gligorov

TL;DR

This paper addresses the need for high-throughput, low-latency ML inference in the LHCb first-level trigger by comparing FPGA and GPU implementations of an embedding MLP within the ETX4VELO track-reconstruction pipeline. Using HLS4ML, the authors implement the MLP on FPGA targets (PYNQ-Z2, Alveo U50, U250) and benchmark against a GPU baseline (RTX 3090 with TensorRT INT8), reporting throughput and energy metrics. The results show that FPGAs can achieve strong energy efficiency and competitive throughput, albeit with trade-offs in upfront cost and design effort, especially when scaling to larger FPGA cards. The work highlights practical considerations for deploying ML in real-time HEP pipelines and points to quantization-aware training and hardware-aware optimizations as avenues to further close the gap with GPU performance.

Abstract

In high-energy physics, the increasing luminosity and detector granularity at the Large Hadron Collider are driving the need for more efficient data processing solutions. Machine Learning has emerged as a promising tool for reconstructing charged particle tracks, due to its potentially linear computational scaling with detector hits. The recent implementation of a graph neural network-based track reconstruction pipeline in the first level trigger of the LHCb experiment on GPUs serves as a platform for comparative studies between computational architectures in the context of high-energy physics. This paper presents a novel comparison of the throughput of ML model inference between FPGAs and GPUs, focusing on the first step of the track reconstruction pipeline$\unicode{x2013}$an implementation of a multilayer perceptron. Using HLS4ML for FPGA deployment, we benchmark its performance against the GPU implementation and demonstrate the potential of FPGAs for high-throughput, low-latency inference without the need for an expertise in FPGA development and while consuming significantly less power.

Comparative Analysis of FPGA and GPU Performance for Machine Learning-Based Track Reconstruction at LHCb

TL;DR

This paper addresses the need for high-throughput, low-latency ML inference in the LHCb first-level trigger by comparing FPGA and GPU implementations of an embedding MLP within the ETX4VELO track-reconstruction pipeline. Using HLS4ML, the authors implement the MLP on FPGA targets (PYNQ-Z2, Alveo U50, U250) and benchmark against a GPU baseline (RTX 3090 with TensorRT INT8), reporting throughput and energy metrics. The results show that FPGAs can achieve strong energy efficiency and competitive throughput, albeit with trade-offs in upfront cost and design effort, especially when scaling to larger FPGA cards. The work highlights practical considerations for deploying ML in real-time HEP pipelines and points to quantization-aware training and hardware-aware optimizations as avenues to further close the gap with GPU performance.

Abstract

In high-energy physics, the increasing luminosity and detector granularity at the Large Hadron Collider are driving the need for more efficient data processing solutions. Machine Learning has emerged as a promising tool for reconstructing charged particle tracks, due to its potentially linear computational scaling with detector hits. The recent implementation of a graph neural network-based track reconstruction pipeline in the first level trigger of the LHCb experiment on GPUs serves as a platform for comparative studies between computational architectures in the context of high-energy physics. This paper presents a novel comparison of the throughput of ML model inference between FPGAs and GPUs, focusing on the first step of the track reconstruction pipelinean implementation of a multilayer perceptron. Using HLS4ML for FPGA deployment, we benchmark its performance against the GPU implementation and demonstrate the potential of FPGAs for high-throughput, low-latency inference without the need for an expertise in FPGA development and while consuming significantly less power.

Paper Structure

This paper contains 13 sections, 2 tables.