Table of Contents
Fetching ...

FAIR AI Models in High Energy Physics

Javier Duarte, Haoyang Li, Avik Roy, Ruike Zhu, E. A. Huerta, Daniel Diaz, Philip Harris, Raghav Kansal, Daniel S. Katz, Ishaan H. Kavoori, Volodymyr V. Kindratenko, Farouk Mokhtar, Mark S. Neubauer, Sang Eon Park, Melissa Quinnan, Roger Rusack, Zhizhen Zhao

TL;DR

This paper reframes how AI models used in high energy physics should be shared and reused by proposing operational FAIR principles for HEP AI models and introducing Cookiecutter4fair, an automating template to scaffold FAIR AI projects. It defines $F$, $A$, $I$, and $R$ as the core FAIR properties for models and emphasizes that the training data must also be FAIR, while acknowledging reproducibility challenges due to backend optimizations. The authors demonstrate a concrete FAIR implementation of a Higgs→bb interaction network, including a graph neural network architecture, a DLHub deployment, and ONNX/TensorRT portability across hardware and software stacks, together with interpretability analyses using XAI methods. The results show robust reproducibility, cross-platform portability, and valuable insights into feature importance and neuron activation, underscoring the practical viability and impact of a FAIR AI ecosystem for automated AI-driven discoveries in HEP. Overall, this work lays a foundation for interoperable AI assets in HEP and advocates for standardized APIs and federated repositories to accelerate collaboration and reuse across disciplines.

Abstract

The findable, accessible, interoperable, and reusable (FAIR) data principles provide a framework for examining, evaluating, and improving how data is shared to facilitate scientific discovery. Generalizing these principles to research software and other digital products is an active area of research. Machine learning (ML) models -- algorithms that have been trained on data without being explicitly programmed -- and more generally, artificial intelligence (AI) models, are an important target for this because of the ever-increasing pace with which AI is transforming scientific domains, such as experimental high energy physics (HEP). In this paper, we propose a practical definition of FAIR principles for AI models in HEP and describe a template for the application of these principles. We demonstrate the template's use with an example AI model applied to HEP, in which a graph neural network is used to identify Higgs bosons decaying to two bottom quarks. We report on the robustness of this FAIR AI model, its portability across hardware architectures and software frameworks, and its interpretability.

FAIR AI Models in High Energy Physics

TL;DR

This paper reframes how AI models used in high energy physics should be shared and reused by proposing operational FAIR principles for HEP AI models and introducing Cookiecutter4fair, an automating template to scaffold FAIR AI projects. It defines , , , and as the core FAIR properties for models and emphasizes that the training data must also be FAIR, while acknowledging reproducibility challenges due to backend optimizations. The authors demonstrate a concrete FAIR implementation of a Higgs→bb interaction network, including a graph neural network architecture, a DLHub deployment, and ONNX/TensorRT portability across hardware and software stacks, together with interpretability analyses using XAI methods. The results show robust reproducibility, cross-platform portability, and valuable insights into feature importance and neuron activation, underscoring the practical viability and impact of a FAIR AI ecosystem for automated AI-driven discoveries in HEP. Overall, this work lays a foundation for interoperable AI assets in HEP and advocates for standardized APIs and federated repositories to accelerate collaboration and reuse across disciplines.

Abstract

The findable, accessible, interoperable, and reusable (FAIR) data principles provide a framework for examining, evaluating, and improving how data is shared to facilitate scientific discovery. Generalizing these principles to research software and other digital products is an active area of research. Machine learning (ML) models -- algorithms that have been trained on data without being explicitly programmed -- and more generally, artificial intelligence (AI) models, are an important target for this because of the ever-increasing pace with which AI is transforming scientific domains, such as experimental high energy physics (HEP). In this paper, we propose a practical definition of FAIR principles for AI models in HEP and describe a template for the application of these principles. We demonstrate the template's use with an example AI model applied to HEP, in which a graph neural network is used to identify Higgs bosons decaying to two bottom quarks. We report on the robustness of this FAIR AI model, its portability across hardware architectures and software frameworks, and its interpretability.
Paper Structure (30 sections, 6 equations, 9 figures, 10 tables)

This paper contains 30 sections, 6 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Folder hierarchy of the cookiecutter4fair v1.0.0 cookiecutter4fair project template. The main Python source code is contained in src. The docs folder contains a Sphinx project for generating documentation.
  • Figure 2: Illustration of a $\mathrm{H}\to\mathrm{b}\xspace\overline{\mathrm{b}\xspace}\xspace\xspace$ jet with two secondary vertices (SVs) from the decay of two bottom hadrons resulting in charged-particle tracks (including a low-energy, or soft, lepton) that are displaced with respect to the primary collision vertex (PV), and hence have a large impact parameter (IP) value.
  • Figure 3: Network architecture and dataflow in the IN model Moreno:2019neq. The choice of model hyperparameters and input data dimensions for the baseline model is given in the accompanying table.
  • Figure 4: GPU utilization (shown as a blue line) and throughput (shown as box-and-whisker plots) as a function of batch size. GPU utilization saturates at 100% for a batch size of 1000, while throughput peaks at 35 k inferred events per second for a batch size of 1200. For the box-and-whisker throughput plots, ten runs are performed with a given batch size. The black line represents the median value of the throughput, the orange box represents the range from the first quartile to the third quartile, and the whiskers extend an additional distance of 1.5$\times$ the interquartile range. The white circles represent the outliers.
  • Figure 5: Change in AUC score with respect to a baseline model when each of the tracks and secondary vertex (SV) features are individually masked during inference.
  • ...and 4 more figures