Table of Contents
Fetching ...

A multi-event interface for next-to-leading order calculations in MadGraph5_aMC@NLO

Rikkert Frederix, Stefan Roiser, Robert Schöfbeck, Zenny Wettersten, Marco Zaro

TL;DR

This work introduces a multi-event interface to enable batched evaluation of tree-level amplitudes across multiple phase-space points within a single NLO cross-section computation in MadGraph5_aMC@NLO. A multithreaded OpenMP proof-of-concept demonstrates data-parallel evaluation of tree-level amplitudes and reproduces sequential results within numerical caveats. The authors show that tree-level amplitudes dominate runtime in NLO event generation, motivating data-parallel strategies and paving the way for on-CPU SIMD and SIMT GPU acceleration. They discuss algorithmic adjustments and overheads, address practical challenges in phase-space cuts and unweighting, and outline a path toward scalable hardware-accelerated NLO event generation.

Abstract

We detail the implementation of a multi-event interface for next-to-leading order (NLO) calculations in MadGraph5_aMC@NLO, allowing tree-level scattering amplitudes for multiple phase space points to be evaluated in each call to the integrated NLO differential cross section during event generation. Additionally, a multithreaded implementation based on this multi-event interface where tree-level amplitudes are evaluated in parallel across multiple CPU threads is presented for the Monte Carlo generation of quantum chromodynamical (QCD) events. Although this work primarily concerns the implemented code, some algorithmic changes involving the order of the application of phase-space cuts and calls to different scattering amplitudes are included. The codebase currently supports multi-threaded execution, but these changes pave the way for continued data parallelism in the form of on-CPU SIMD instructions or SIMT GPU offloading. A study in the runtime fraction spent in different diagrammatic contributions across various processes suggests that NLO QCD event generation are computationally dominated by tree-level scattering amplitude evaluations, which we show are perfectly suited for data parallelisation.

A multi-event interface for next-to-leading order calculations in MadGraph5_aMC@NLO

TL;DR

This work introduces a multi-event interface to enable batched evaluation of tree-level amplitudes across multiple phase-space points within a single NLO cross-section computation in MadGraph5_aMC@NLO. A multithreaded OpenMP proof-of-concept demonstrates data-parallel evaluation of tree-level amplitudes and reproduces sequential results within numerical caveats. The authors show that tree-level amplitudes dominate runtime in NLO event generation, motivating data-parallel strategies and paving the way for on-CPU SIMD and SIMT GPU acceleration. They discuss algorithmic adjustments and overheads, address practical challenges in phase-space cuts and unweighting, and outline a path toward scalable hardware-accelerated NLO event generation.

Abstract

We detail the implementation of a multi-event interface for next-to-leading order (NLO) calculations in MadGraph5_aMC@NLO, allowing tree-level scattering amplitudes for multiple phase space points to be evaluated in each call to the integrated NLO differential cross section during event generation. Additionally, a multithreaded implementation based on this multi-event interface where tree-level amplitudes are evaluated in parallel across multiple CPU threads is presented for the Monte Carlo generation of quantum chromodynamical (QCD) events. Although this work primarily concerns the implemented code, some algorithmic changes involving the order of the application of phase-space cuts and calls to different scattering amplitudes are included. The codebase currently supports multi-threaded execution, but these changes pave the way for continued data parallelism in the form of on-CPU SIMD instructions or SIMT GPU offloading. A study in the runtime fraction spent in different diagrammatic contributions across various processes suggests that NLO QCD event generation are computationally dominated by tree-level scattering amplitude evaluations, which we show are perfectly suited for data parallelisation.
Paper Structure (9 sections, 4 equations, 2 figures, 1 table)

This paper contains 9 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Runtime profile for one integration channel for the $g g \to t \overline{t}$ partonic contribution to the process $pp\to t \overline{t}$ in MG5aMC --- using our multi-event interface --- as a function of the number of concurrent threads. For these tests, a single call was made to the differential cross section subroutine sigintF with 65 536 randomly generated phase space points to be evaluated through the multi-event interface, which greatly exaggerates the sequential overhead shown in the left-hand plot. The right-hand plot shows the real (wall) runtimes spent in scattering amplitude evaluations alongside the predicted runtimes, given by the single-threaded runtime divided by the number of concurrent threads. Displayed measurements are given by mean values of five independent runs, with standard deviations denoted by error bars. All tests were run on an AMD Ryzen 7 PRO 8840U.
  • Figure 2: Runtime profile for one integration channel for the $g g \to t \overline{t}$ partonic contribution to the process $pp\to t \overline{t} \, j$ in MG5aMC --- using our multi-event interface --- as a function of the number of concurrent threads. For these tests, a single call was made to the differential cross section subroutine sigintF with 65 536 randomly generated phase space points to be evaluated through the multi-event interface, which greatly exaggerates the sequential overhead shown in the left-hand plot. The right-hand plot shows the real (wall) runtimes spent in scattering amplitude evaluations alongside the predicted runtimes, given by the single-threaded runtime divided by the number of concurrent threads. Displayed measurements are given by mean values of five independent runs, with standard deviations denoted by error bars. All tests were run on an AMD Ryzen 7 PRO 8840U.