Table of Contents
Fetching ...

Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques

A. Verdone, A. Devoto, C. Sebastiani, J. Carmignani, M. D'Onofrio, S. Giagu, S. Scardapane, M. Panella

TL;DR

This work tackles the challenge of analyzing massive LHC collision data by combining Graph Neural Networks with gradient-based data attribution. It introduces a pipeline where a GNN is first trained on either full or subset data, then influence scores (via TracIn) identify training samples that positively or negatively affect predictions; the dataset is distilled by removing non-contributory samples and the model is retrained on this reduced set. Empirical results on simulated SUSY ATLAS events show that influence-based data distillation can achieve, and often exceed, the performance of full-data or random-subset training while significantly reducing computational costs. The approach also provides enhanced explainability by inspecting which elements of the training data were discarded, and the method is flexible to integrate other attribution techniques for broader applicability in data-intensive physics analyses.

Abstract

The experiments at the Large Hadron Collider at CERN generate vast amounts of complex data from high-energy particle collisions. This data presents significant challenges due to its volume and complex reconstruction, necessitating the use of advanced analysis techniques for analysis. Recent advancements in deep learning, particularly Graph Neural Networks, have shown promising results in addressing the challenges but remain computationally expensive. The study presented in this paper uses a simulated particle collision dataset to integrate influence analysis inside the graph classification pipeline aiming at improving the accuracy and efficiency of collision event prediction tasks. By using a Graph Neural Network for initial training, we applied a gradient-based data influence method to identify influential training samples and then we refined the dataset by removing non-contributory elements: the model trained on this new reduced dataset can achieve good performances at a reduced computational cost. The method is completely agnostic to the specific influence method: different influence modalities can be easily integrated into our methodology. Moreover, by analyzing the discarded elements we can provide further insights about the event classification task. The novelty of integrating data attribution techniques together with Graph Neural Networks in high-energy physics tasks can offer a robust solution for managing large-scale data problems, capturing critical patterns, and maximizing accuracy across several high-data demand domains.

Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques

TL;DR

This work tackles the challenge of analyzing massive LHC collision data by combining Graph Neural Networks with gradient-based data attribution. It introduces a pipeline where a GNN is first trained on either full or subset data, then influence scores (via TracIn) identify training samples that positively or negatively affect predictions; the dataset is distilled by removing non-contributory samples and the model is retrained on this reduced set. Empirical results on simulated SUSY ATLAS events show that influence-based data distillation can achieve, and often exceed, the performance of full-data or random-subset training while significantly reducing computational costs. The approach also provides enhanced explainability by inspecting which elements of the training data were discarded, and the method is flexible to integrate other attribution techniques for broader applicability in data-intensive physics analyses.

Abstract

The experiments at the Large Hadron Collider at CERN generate vast amounts of complex data from high-energy particle collisions. This data presents significant challenges due to its volume and complex reconstruction, necessitating the use of advanced analysis techniques for analysis. Recent advancements in deep learning, particularly Graph Neural Networks, have shown promising results in addressing the challenges but remain computationally expensive. The study presented in this paper uses a simulated particle collision dataset to integrate influence analysis inside the graph classification pipeline aiming at improving the accuracy and efficiency of collision event prediction tasks. By using a Graph Neural Network for initial training, we applied a gradient-based data influence method to identify influential training samples and then we refined the dataset by removing non-contributory elements: the model trained on this new reduced dataset can achieve good performances at a reduced computational cost. The method is completely agnostic to the specific influence method: different influence modalities can be easily integrated into our methodology. Moreover, by analyzing the discarded elements we can provide further insights about the event classification task. The novelty of integrating data attribution techniques together with Graph Neural Networks in high-energy physics tasks can offer a robust solution for managing large-scale data problems, capturing critical patterns, and maximizing accuracy across several high-data demand domains.
Paper Structure (14 sections, 2 equations, 6 figures, 2 tables)

This paper contains 14 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Event collision represented in a 2D plane with $\varphi$ and $\eta$ as axis. The $\varphi$-$\eta$ plane in an LHC experiment is a coordinate system used to describe the angular distribution of particles, where $\eta$ measures the particle's angle relative to the beam axis and $\varphi$ represents the azimuthal angle around the beam axis. Edges of fully connected graph are not shown for clarity.
  • Figure 2: Our proposed methodology: we initially train the GNN network on the original full-size dataset or a subset of it. Then, we employ the saved checkpoints to compute influence values on training data: values with a higher score will be filtered out. We obtain a distilled dataset on which we perform the final training.
  • Figure 3: Complete graphs with kinematic features as nodes.
  • Figure 4: Features ATLAS:2023act exploited for each node.
  • Figure 5: AUROC score profile varying percentages of the initial randomly selected dataset, using 0.8 (a) and no (b) thresholds on influence values.
  • ...and 1 more figures