Table of Contents
Fetching ...

Scaling Graph Neural Networks for Particle Track Reconstruction

Alok Tripathy, Alina Lazar, Xiangyang Ju, Paolo Calafiura, Katherine Yelick, Aydin Buluc

TL;DR

This work tackles memory and generalization challenges in graph neural network–based particle track reconstruction by enabling minibatch training on vertex-subset subgraphs via ShaDow sampling and introducing matrix-based acceleration. It combines an Interaction Network GNN with subgraph minibatching and optimized distributed execution to scale to large event graphs without sacrificing accuracy. The approach yields higher precision and recall than the prior full-graph training regime and achieves 1.3×–2× speedups over PyG baselines, with notable gains in GPU utilization and reduced all-reduce latency. Collectively, these advances make scalable, detector-agnostic GNN-based track reconstruction more practical for high-energy physics workloads.

Abstract

Particle track reconstruction is an important problem in high-energy physics (HEP), necessary to study properties of subatomic particles. Traditional track reconstruction algorithms scale poorly with the number of particles within the accelerator. The Exa.TrkX project, to alleviate this computational burden, introduces a pipeline that reduces particle track reconstruction to edge classification on a graph, and uses graph neural networks (GNNs) to produce particle tracks. However, this GNN-based approach is memory-prohibitive and skips graphs that would exceed GPU memory. We introduce improvements to the Exa.TrkX pipeline to train on samples of input particle graphs, and show that these improvements generalize to higher precision and recall. In addition, we adapt performance optimizations, introduced for GNN training, to fit our augmented Exa.TrkX pipeline. These optimizations provide a $2\times$ speedup over our baseline implementation in PyTorch Geometric.

Scaling Graph Neural Networks for Particle Track Reconstruction

TL;DR

This work tackles memory and generalization challenges in graph neural network–based particle track reconstruction by enabling minibatch training on vertex-subset subgraphs via ShaDow sampling and introducing matrix-based acceleration. It combines an Interaction Network GNN with subgraph minibatching and optimized distributed execution to scale to large event graphs without sacrificing accuracy. The approach yields higher precision and recall than the prior full-graph training regime and achieves 1.3×–2× speedups over PyG baselines, with notable gains in GPU utilization and reduced all-reduce latency. Collectively, these advances make scalable, detector-agnostic GNN-based track reconstruction more practical for high-energy physics workloads.

Abstract

Particle track reconstruction is an important problem in high-energy physics (HEP), necessary to study properties of subatomic particles. Traditional track reconstruction algorithms scale poorly with the number of particles within the accelerator. The Exa.TrkX project, to alleviate this computational burden, introduces a pipeline that reduces particle track reconstruction to edge classification on a graph, and uses graph neural networks (GNNs) to produce particle tracks. However, this GNN-based approach is memory-prohibitive and skips graphs that would exceed GPU memory. We introduce improvements to the Exa.TrkX pipeline to train on samples of input particle graphs, and show that these improvements generalize to higher precision and recall. In addition, we adapt performance optimizations, introduced for GNN training, to fit our augmented Exa.TrkX pipeline. These optimizations provide a speedup over our baseline implementation in PyTorch Geometric.

Paper Structure

This paper contains 17 sections, 3 equations, 4 figures, 1 table, 2 algorithms.

Figures (4)

  • Figure 1: Exa.TrkX GNN pipeline
  • Figure 2: Matrix-Based ShaDow Sampling Algorithm for the example graph and batch. When sampling multiple minibatches, we would stack the $\mathbf{Q}^d$ matrices for each batch and input the output stacked matrix into the bulk-sampling routine.
  • Figure 3: Epoch Time results for the Exa.TrkX pipeline across GPUs with the PyG implementation of ShaDow, and with our implementation of ShaDow and all-reduce optimization. Here, $k$ is the number of minibatches that were sampled in bulk during a single step of training. PyG timed out on $p=4$ processes for CTD training, so results could not be collected.
  • Figure 4: Convergence results on Ex3 for full-graph training, ShaDow training with PyG's implementation, and ShaDow training with our implementation. Precision and recall are based on the number of correctly classified edges across validation set particle graphs and the total number of edges.