Table of Contents
Fetching ...

New GPU developments in the Madgraph CUDACPP plugin: kernel splitting, helicity streams, cuBLAS color sums

Andrea Valassi

TL;DR

This work addresses the GPU/SIMD bottleneck in LO matrix-element calculations within MG5aMC by introducing kernel splitting in the CUDACPP plugin. It first splits the monolithic sigmaKin kernel into helicity streams and a separate color-sum path, with optional BLAS offload, and then extends to partitioning Feynman diagrams into groups or device functions, enabling computations for highly gluon-rich final states such as $gg\rightarrow t\bar{t}gggg$ ($2\rightarrow6$) and $gg\rightarrow t\bar{t}ggggg$ ($2\rightarrow7$). The results show substantial throughput gains on NVidia GPUs for complex processes, with manageable impacts on simpler processes, and demonstrate new capabilities on CPUs and AMD GPUs, while preserving physics accuracy. The work also documents previously undocumented features of CUDACPP and discusses production-release implications, suggesting upstream integration and future directions for further performance and scalability improvements in MG5aMC.

Abstract

The first production release of the CUDACPP plugin for the Madgraph5_aMC@NLO generator, which speeds up matrix element (ME) calculations for leading-order (LO) processes using a data parallel approach on vector CPUs and GPUs, was delivered in October 2024. This was described in previous publications by the team behind that effort. In this paper, I describe my work on some additional developments and optimizations of CUDACPP, mainly but not exclusively for GPUs. The new approach, which represents a major restructuring of the CUDACPP computational engine, primarily consists in splitting the ME calculation, previously performed using a single large GPU kernel, into many smaller kernels. A first batch of changes, involving the move to separate "helicity streams" and the optional offloading of QCD color sums to BLAS, was recently merged into a new CUDACPP release, in collaboration with my colleagues. Since then, I have completed a second batch of changes, involving the possibility to split the calculation into groups of Feynman diagrams in separate source code files. This new feature makes it possible to compute QCD matrix elements for physics processes with a larger number of final state gluons: in particular, I present the first performance results from CUDACPP for the $2\!\rightarrow\!6$ process $gg\!\rightarrow\!t\bar{t}gggg$ on CPUs and GPUs and the $2\!\rightarrow\!7$ process $gg\!\rightarrow\!t\bar{t}ggggg$ on CPUs, which involve over 15k and 230k Feynman diagrams, respectively. I also take this opportunity to describe in detail some previously undocumented features of the CUDACPP software, both in the GPU and vector CPU implementations.

New GPU developments in the Madgraph CUDACPP plugin: kernel splitting, helicity streams, cuBLAS color sums

TL;DR

This work addresses the GPU/SIMD bottleneck in LO matrix-element calculations within MG5aMC by introducing kernel splitting in the CUDACPP plugin. It first splits the monolithic sigmaKin kernel into helicity streams and a separate color-sum path, with optional BLAS offload, and then extends to partitioning Feynman diagrams into groups or device functions, enabling computations for highly gluon-rich final states such as () and (). The results show substantial throughput gains on NVidia GPUs for complex processes, with manageable impacts on simpler processes, and demonstrate new capabilities on CPUs and AMD GPUs, while preserving physics accuracy. The work also documents previously undocumented features of CUDACPP and discusses production-release implications, suggesting upstream integration and future directions for further performance and scalability improvements in MG5aMC.

Abstract

The first production release of the CUDACPP plugin for the Madgraph5_aMC@NLO generator, which speeds up matrix element (ME) calculations for leading-order (LO) processes using a data parallel approach on vector CPUs and GPUs, was delivered in October 2024. This was described in previous publications by the team behind that effort. In this paper, I describe my work on some additional developments and optimizations of CUDACPP, mainly but not exclusively for GPUs. The new approach, which represents a major restructuring of the CUDACPP computational engine, primarily consists in splitting the ME calculation, previously performed using a single large GPU kernel, into many smaller kernels. A first batch of changes, involving the move to separate "helicity streams" and the optional offloading of QCD color sums to BLAS, was recently merged into a new CUDACPP release, in collaboration with my colleagues. Since then, I have completed a second batch of changes, involving the possibility to split the calculation into groups of Feynman diagrams in separate source code files. This new feature makes it possible to compute QCD matrix elements for physics processes with a larger number of final state gluons: in particular, I present the first performance results from CUDACPP for the process on CPUs and GPUs and the process on CPUs, which involve over 15k and 230k Feynman diagrams, respectively. I also take this opportunity to describe in detail some previously undocumented features of the CUDACPP software, both in the GPU and vector CPU implementations.

Paper Structure

This paper contains 25 sections, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Schematic representation of the architectural evolution of matrix element calculations in the MG5aMC CUDACPP plugin, in the context of the kernel splitting developments presented in this paper. The steps above the blue dotted line, up to ihel3 and ihel4 included, have been presented in the first preprint bib:preprintV1 of this paper, while those below it are only described here. The steps in the green boxes (up to ihel3p1 included) have been merged upstream and included in production releases of CUDACPP; the steps in the red boxes (from ihel4 to ihel6p2 included) and the step in the orange box (csm), conversely, are included in two pull requests that I recommend for merging upstream. More details about each step are provided in the text and in other figures.
  • Figure 2: Schematic representation of the architectural evolution of the work on the MG5aMC CUDACPP plugin between 2020 and 2024. This plot was prepared for CHEP2024 and is taken as-is from its proceedings bib:chep2024, where additional details can be found.
  • Figure 3: Schematic representation of the CUDACPP engine for computing MEs ( sigmaKin), and of its evolution through the first four scenarios described in this paper: (ihel0) current version before kernel splitting; (ihel1) helicity streams; (ihel2) color sum kernel; (ihel3b) color sum on BLAS via host dispatcher. For the ihel3 software, only the (non-default) case with BLAS enabled at runtime is illustrated: by default, the ihel3 software has BLAS disabled at runtime, which is essentially the same as what is shown for the ihel2 scenario (the only difference is that in the ihel3 scenario the kernel is named color_sum_kernel and is invoked by a color_sum_gpu host function, which could also dispatch the calculation to the color_sum_blas BLAS host function).
  • Figure 4: Throughputs (ME/s) as a function of grid size for an NVidia V100 GPU at CERN, on a node equipped with Intel Xeon Silver 4216 CPUs. Code built using CUDA 12.0 and gcc 11.5. Higher is better. The 12 plots correspond to 4 physics processes in 3 floating point precisions. The number of threads per block is fixed to 32 (NVidia GPU warp size); the grid size is varied by changing the number of blocks. Each plot compares different scenarios considered in this paper: (ihel0) release v1.00.02, before kernel splitting; (ihel1) helicity streams; (ihel3) color sum kernel; (ihel3b) cuBLAS color sum; (ihel3p1) release v1.01.01, color sum kernel; (ihel3b) release v1.01.01, cuBLAS color sum; (ihel4) Feynman diagrams as individual kernels; (ihel6p2) diagram groups, color sum kernel; (ihel6p2b) diagram groups, cuBLAS color sum. For ihel6p2 and ihel6p2b, all four processes were generated with a single diagram group, executed as a kernel at runtime ( DCDIAG=0), without graphs.
  • Figure 5: Throughputs (ME/s) as a function of grid size for an AMD MI200 GPU at LUMI, on a node equipped with AMD EPYC 7A53 CPUs. Code built using ROCm 6.0 and gcc 13.2. Higher is better. The 9 plots correspond to 3 physics processes in 3 floating point precisions. The number of threads per block is fixed to 256; the grid size is varied by changing the number of blocks. Each plot compares different scenarios considered in this paper: (ihel0) release v1.00.02, before kernel splitting; (ihel1) helicity streams; (ihel3) color sum kernel; (ihel3b) hipBLAS color sum; (ihel3p1) release v1.01.01, color sum kernel; (ihel3b) release v1.01.01, hipBLAS color sum; (ihel4) Feynman diagrams as individual kernels; (ihel6p2) diagram groups, color sum kernel; (ihel6p2b) diagram groups, hipBLAS color sum. For ihel6p2 and ihel6p2b, all four processes were generated with a single diagram group, which was executed as a kernel at runtime ( DCDIAG=0), without graphs.
  • ...and 7 more figures