Madgraph on GPUs and vector CPUs: towards production (The 5-year journey to the first LO release CUDACPP v1.00.00)
Andrea Valassi, Taylor Childers, Stephan Hageböck, Daniele Massaro, Olivier Mattelaer, Nathan Nichols, Filip Optolowicz, Stefan Roiser, Jørgen Teig, Zenny Wettersten
TL;DR
This paper reports on porting MadGraph5_aMC@NLO's LO event generation to GPUs and vector CPUs via the CUDACPP plugin, aiming to reduce compute costs in HL-LHC workflows. It documents architectural evolution from a sequential CPU pipeline to a data-parallel, multi-event ME kernel design that runs on CUDA/C++ for GPUs and SIMD on CPUs. The authors present thorough functional, performance, and integration testing, including CMS feedback and a packaging strategy that links two repositories via tarballs and submodules. Key results show LO ME speedups of up to ~180x on GPUs and ~16x on CPUs, with total workflow gains limited by non-ME bottlenecks per Amdahl's law. The work lays groundwork for extending to NLO and other backends (AMD HIP, SYCL) and outlines ongoing efforts to optimize phase-space sampling and PDF evaluations for production use.
Abstract
The effort to speed up the Madgraph5_aMC@NLO generator by exploiting CPU vectorization and GPUs, which started at the beginning of 2020, has delivered the first production release of the code for leading-order (LO) processes in October 2024. To achieve this goal, many new features, tests and fixes have been implemented in recent months. This process benefitted also from the early feedback of the CMS experiment. In this contribution, we report on these activities and on the status of the LO software at the time of CHEP2024.
