Table of Contents
Fetching ...

Madgraph on GPUs and vector CPUs: towards production (The 5-year journey to the first LO release CUDACPP v1.00.00)

Andrea Valassi, Taylor Childers, Stephan Hageböck, Daniele Massaro, Olivier Mattelaer, Nathan Nichols, Filip Optolowicz, Stefan Roiser, Jørgen Teig, Zenny Wettersten

TL;DR

This paper reports on porting MadGraph5_aMC@NLO's LO event generation to GPUs and vector CPUs via the CUDACPP plugin, aiming to reduce compute costs in HL-LHC workflows. It documents architectural evolution from a sequential CPU pipeline to a data-parallel, multi-event ME kernel design that runs on CUDA/C++ for GPUs and SIMD on CPUs. The authors present thorough functional, performance, and integration testing, including CMS feedback and a packaging strategy that links two repositories via tarballs and submodules. Key results show LO ME speedups of up to ~180x on GPUs and ~16x on CPUs, with total workflow gains limited by non-ME bottlenecks per Amdahl's law. The work lays groundwork for extending to NLO and other backends (AMD HIP, SYCL) and outlines ongoing efforts to optimize phase-space sampling and PDF evaluations for production use.

Abstract

The effort to speed up the Madgraph5_aMC@NLO generator by exploiting CPU vectorization and GPUs, which started at the beginning of 2020, has delivered the first production release of the code for leading-order (LO) processes in October 2024. To achieve this goal, many new features, tests and fixes have been implemented in recent months. This process benefitted also from the early feedback of the CMS experiment. In this contribution, we report on these activities and on the status of the LO software at the time of CHEP2024.

Madgraph on GPUs and vector CPUs: towards production (The 5-year journey to the first LO release CUDACPP v1.00.00)

TL;DR

This paper reports on porting MadGraph5_aMC@NLO's LO event generation to GPUs and vector CPUs via the CUDACPP plugin, aiming to reduce compute costs in HL-LHC workflows. It documents architectural evolution from a sequential CPU pipeline to a data-parallel, multi-event ME kernel design that runs on CUDA/C++ for GPUs and SIMD on CPUs. The authors present thorough functional, performance, and integration testing, including CMS feedback and a packaging strategy that links two repositories via tarballs and submodules. Key results show LO ME speedups of up to ~180x on GPUs and ~16x on CPUs, with total workflow gains limited by non-ME bottlenecks per Amdahl's law. The work lays groundwork for extending to NLO and other backends (AMD HIP, SYCL) and outlines ongoing efforts to optimize phase-space sampling and PDF evaluations for production use.

Abstract

The effort to speed up the Madgraph5_aMC@NLO generator by exploiting CPU vectorization and GPUs, which started at the beginning of 2020, has delivered the first production release of the code for leading-order (LO) processes in October 2024. To achieve this goal, many new features, tests and fixes have been implemented in recent months. This process benefitted also from the early feedback of the CMS experiment. In this contribution, we report on these activities and on the status of the LO software at the time of CHEP2024.

Paper Structure

This paper contains 3 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Schematic representation of the architectural evolution of MG5aMC. The main difference between the old Fortran-only version (top left, pink background) and those based on CUDACPP (light blue background) is that the former uses a sequential single-event API for the calculation of matrix elements, while the latter uses a data-parallel multi-event API. Additional details on the evolution of the work on the CUDACPP plugin between 2020 and 2024 are provided in the text. These plots, presented at CHEP2024 bib:chep2024slides, are derived from those presented in a 2020 talk to the HSF generator WG bib:hsf2020.
  • Figure 2: Breakdown of the ME and various non-ME contributions to the overall runtime of a DY+3jets ($pp\!\rightarrow\! \ell^+\!\ell^-\!jjj$) gridpack, using Fortran MEs (left) or CUDACPP "512z" C++ MEs (right). Gridpack launched on CERN itgold91 with Intel Gold 6326 CPUs, using gcc11.4 builds. The numbers refer to the generation of 100 unweighted events: this involved the execution of 108 madevent applications, each processing 16384 ME calculations, for an overall total of 1.8M ME calculations on weighted events.