Track reconstruction as a service for collider physics

Haoran Zhao; Yuan-Tang Chou; Yao Yao; Xiangyang Ju; Yongbin Feng; William Patrick McCormack; Miles Cochran-Branson; Jan-Frederik Schulte; Miaoyuan Liu; Javier Duarte; Philip Harris; Shih-Chieh Hsu; Kevin Pedro; Nhan Tran

Track reconstruction as a service for collider physics

Haoran Zhao, Yuan-Tang Chou, Yao Yao, Xiangyang Ju, Yongbin Feng, William Patrick McCormack, Miles Cochran-Branson, Jan-Frederik Schulte, Miaoyuan Liu, Javier Duarte, Philip Harris, Shih-Chieh Hsu, Kevin Pedro, Nhan Tran

TL;DR

The paper addresses the growing computational burden of charged-particle track reconstruction at the HL-LHC by proposing an inference-as-a-service framework that offloads tracking to GPUs via NVIDIA Triton. It evaluates two representative pipelines, Patatrack (rule-based) and Exa.TrkX (ML-based), showing improved GPU utilization and the ability to serve multiple CPU cores concurrently with minimal per-request latency. Key contributions include the implementation of custom Triton backends for both pipelines, comprehensive throughput and latency measurements, and integration with the ACTS framework to demonstrate end-to-end workflow performance. The results indicate substantial speedups and efficiency gains over CPU-only approaches, with potential reductions in GPU count and operational power, offering a scalable path for HL-LHC computing in the face of increasing pileup.

Abstract

Optimizing charged-particle track reconstruction algorithms is crucial for efficient event reconstruction in Large Hadron Collider (LHC) experiments due to their significant computational demands. Existing track reconstruction algorithms have been adapted to run on massively parallel coprocessors, such as graphics processing units (GPUs), to reduce processing time. Nevertheless, challenges remain in fully harnessing the computational capacity of coprocessors in a scalable and non-disruptive manner. This paper proposes an inference-as-a-service approach for particle tracking in high energy physics experiments. To evaluate the efficacy of this approach, two distinct tracking algorithms are tested: Patatrack, a rule-based algorithm, and Exa$.$TrkX, a machine learning-based algorithm. The as-a-service implementations show enhanced GPU utilization and can process requests from multiple CPU cores concurrently without increasing per-request latency. The impact of data transfer is minimal and insignificant compared to running on local coprocessors. This approach greatly improves the computational efficiency of charged particle tracking, providing a solution to the computing challenges anticipated in the High-Luminosity LHC era.

Track reconstruction as a service for collider physics

TL;DR

Abstract

TrkX, a machine learning-based algorithm. The as-a-service implementations show enhanced GPU utilization and can process requests from multiple CPU cores concurrently without increasing per-request latency. The impact of data transfer is minimal and insignificant compared to running on local coprocessors. This approach greatly improves the computational efficiency of charged particle tracking, providing a solution to the computing challenges anticipated in the High-Luminosity LHC era.

Paper Structure (19 sections, 8 figures, 1 table)

This paper contains 19 sections, 8 figures, 1 table.

Introduction
Background
HEP Computing: online and offline reconstruction
Track reconstruction
Patatrack
Exa.TrkX pipeline
Inference as a service using NVIDIA Triton Inference Server
Custom backend
Model performance measurement
Patatrack as a service
Standalone algorithm throughput tests
HLT workflow throughput scanning
Exa.TrkX as a Service in ACTS
Exa.TrkX backend lifecycle
Standalone algorithm throughput tests
...and 4 more sections

Figures (8)

Figure 1: Inference as-a-service approach: Users send various inference requests from client CPUs, which include details about the type of inference desired, input dimensions and content, and output dimensions and labels. This information is delivered from the clients to the servers through gRPC protocol, a high-performance Remote Procedure Call. The server CPUs receive these tasks, batch them, execute inference on the appropriate coprocessor based on the specific request, and deliver the output back to the client CPUs via gRPC protocol. In this approach, each server can contain a different number of coprocessors and provide different models. Each client can deliver tasks to multiple servers so that the tasks can be processed in parallel. The client-to-server ratio can be scaled based on the demand of client requests.
Figure 2: This illustration shows Patatrack running on a GPU using the as-a-service approach. Common detector-related information, such as the location of each detector layer and detector indices, is preloaded onto the GPU to be used later in the processing. For each event, raw data and beam spot information are delivered to the GPU. The raw data is compressed before delivery, and during the digitization step, it is unpacked and converted back into "digis." In the same detector layer, neighboring digis are clustered to determine the location of a "hit," representing a single interaction of a charged particle with that detector layer. From the hits, those in adjacent layers are paired to form doublets, then pair to triplets as track seeds. From a seed, a group of hits that potentially form a track are picked, and then a fit is applied to determine the track parameters and vertex location. Finally, the reconstructed tracks and vertices are delivered back to the host.
Figure 3: Illustration of the HLT running with tracking as a service. In the HLT workflow, the tracking algorithm inference runs on a server, while other modules in the workflow run on the client CPU in parallel to the tracking algorithms, ensuring no time is wasted waiting for track reconstruction results. This asynchronous pipeline has been implemented in production.
Figure 4: A scan of throughput improvement for the HLT workflow, varying the number of CPU clients communicating with a single GPU server. Direct inference on a GPU is limited to 64 CPUs requesting Patatrack inference. This results in about a 10% throughput gain compared to running it on the CPU alone (black dotted line). Using Patatrack as a service, the system can handle in excess of 120 CPU cores while maintaining the same level of throughput improvement, showing no GPU saturation.
Figure 5: A throughput saturation scan is performed by launching a server with ten model instances loaded on one NVIDIA Tesla T4 GPU. The remote Triton server loads the Patatrack model, receiving inference requests from multiple synchronized 4-thread CPU client jobs. The throughput is expected to stay at the same level before the GPU computing resources are fully saturated. The server becomes fully saturated when around 240 synchronized 4-thread CPU client jobs send requests simultaneously. The throughput starts to drop beyond the saturation point.
...and 3 more figures

Track reconstruction as a service for collider physics

TL;DR

Abstract

Track reconstruction as a service for collider physics

Authors

TL;DR

Abstract

Table of Contents

Figures (8)