Table of Contents
Fetching ...

Rapid event extraction and tensorial event adaption: Libraries for efficient access and generic reweighting of parton-level events and their implementation in the MadtRex module

Stefan Roiser, Robert Schöfbeck, Zenny Wettersten

TL;DR

The paper tackles the computational bottleneck of managing and reweighting parton-level events across large parameter spaces. It introduces Rex and teaRex as C++17 libraries that enable efficient LHE I/O, data-parallel event representations, and completely generic reweighting workflows, forming the foundation for the MadtRex module. The work demonstrates substantial throughput gains, with reweighting on SIMD CPUs and SIMT GPUs achieving more than two orders of magnitude speedups over the standard MG5aMC approach, and shows scalable performance even without explicit parallelism. By bridging object-oriented LHE formats with structure-of-arrays data layouts, Rex/teaRex support flexible, model-agnostic reweighting (e.g., parameter, pdf, SMEFT) while maintaining physics-driven data access and extensible interfaces, enabling practical, large-scale phenomenology studies.

Abstract

We present Rex and teaRex, C++17 libraries for efficient management of parton-level hard scattering event information and completely generic reweighting of such events, respectively. Rex is primarily an interfacing and I/O library for Les Houches Event format files and provides an internal event format designed with data parallelism in mind, and teaRex extends this format to provide full parton-level reweighting functionality with minimal code needing to be written by the end user. These libraries serve as the foundation for the MadtRex reweighting module for MadGraph5_aMC@NLO, extending the functionality of the CUDACPP plugin to allow for data-parallel model-generic leading order parameter reweighting on SIMD-enabled CPUs and SIMT GPUs, speeding up reweighting by more than two orders of magnitude compared to MadGraph5_aMC@NLO running on the exact same hardware while providing trivial scalability to larger and distributed systems.

Rapid event extraction and tensorial event adaption: Libraries for efficient access and generic reweighting of parton-level events and their implementation in the MadtRex module

TL;DR

The paper tackles the computational bottleneck of managing and reweighting parton-level events across large parameter spaces. It introduces Rex and teaRex as C++17 libraries that enable efficient LHE I/O, data-parallel event representations, and completely generic reweighting workflows, forming the foundation for the MadtRex module. The work demonstrates substantial throughput gains, with reweighting on SIMD CPUs and SIMT GPUs achieving more than two orders of magnitude speedups over the standard MG5aMC approach, and shows scalable performance even without explicit parallelism. By bridging object-oriented LHE formats with structure-of-arrays data layouts, Rex/teaRex support flexible, model-agnostic reweighting (e.g., parameter, pdf, SMEFT) while maintaining physics-driven data access and extensible interfaces, enabling practical, large-scale phenomenology studies.

Abstract

We present Rex and teaRex, C++17 libraries for efficient management of parton-level hard scattering event information and completely generic reweighting of such events, respectively. Rex is primarily an interfacing and I/O library for Les Houches Event format files and provides an internal event format designed with data parallelism in mind, and teaRex extends this format to provide full parton-level reweighting functionality with minimal code needing to be written by the end user. These libraries serve as the foundation for the MadtRex reweighting module for MadGraph5_aMC@NLO, extending the functionality of the CUDACPP plugin to allow for data-parallel model-generic leading order parameter reweighting on SIMD-enabled CPUs and SIMT GPUs, speeding up reweighting by more than two orders of magnitude compared to MadGraph5_aMC@NLO running on the exact same hardware while providing trivial scalability to larger and distributed systems.

Paper Structure

This paper contains 21 sections, 7 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Read throughput for an electroweak LHE sample using Rex as a function of file size in terms of number of events for optimisation levels -O0 through -O3 using version 13.2.0. Each point gives an average throughput from 100 measurements, with standard deviations highlighted. To ensure only the library functionality itself was measured, the benchmark executable was compiled with no optimisations. It is clear that the most significant speed-up comes from -O1 optimisation, although -O2 and particularly -O3 do provide additional load speed-up (compare with \ref{['fig:rex_readtp_pp']}, where -O3 has no significant speed-up compared to -O2.) The dip at 100 000 events coincides with the corresponding LHE file exceeding the 16.5 MB L3 cache of the Intel Xeon Gold 5118 Intel5118Specs.
  • Figure 2: Read throughput for a QCD LHE sample using Rex as a function of file size in terms of number of events for optimisation levels -O0 through -O3 using version 13.2.0. Each point gives an average throughput from 100 measurements, with standard deviations highlighted. To ensure only the library functionality itself was measured, the benchmark executable was compiled with no optimisations. It is clear that the most significant speed-up comes from -O1 optimisation, although -O2 does provide additional load speed-up while -O3 does not appear to have any significant impact (compare with \ref{['fig:rex_readtp_ll']}, where -O3 has significant speed-up compared to -O2.) Although the throughput dip is more gradual than in \ref{['fig:rex_readtp_ll']}, the plateau is once again reached at 100 000 events, which also for these samples is when the LHE file size exceeds the 16.5 MB L3 cache of the Intel Xeon Gold 5118 Intel5118Specs
  • Figure 3: Read throughput for an electroweak LHE sample using Rex as a function of file size in terms of number of events for optimisation levels -O0 through -O3 using version 13.2.0. Each point gives an average throughput from 100 measurements, with standard deviations highlighted. To ensure only the library functionality itself was measured, the benchmark executable was compiled with no optimisations. It is clear that the most significant speed-up comes from -O1 optimisation.
  • Figure 4: Read throughput for a QCD LHE sample using Rex as a function of file size in terms of number of events for optimisation levels -O0 through -O3 using version 13.2.0. Each point gives an average throughput from 100 measurements, with standard deviations highlighted. To ensure only the library functionality itself was measured, the benchmark executable was compiled with no optimisations. It is clear that the most significant speed-up comes from -O1 optimisation.
  • Figure 5: Event throughput for MadtRex reweighting as well as the default MG5aMC reweighting module for comparison. Throughputs and standard deviations have been calculated based on mean runtimes for various event samples (ranging from 10 to $10^7$ events) with various number of reweighted parameter sets (ranging from 8 to 6435 iterations). Although GPU offloading has a clear advantage over on-host SIMD parallelism, which in turn is faster than scalar instructions, MadtRex even without any explicit data parallelism is consistently $\sim40$ times faster than MG5aMC reweighting.