Table of Contents
Fetching ...

An NLO-Matched Initial and Final State Parton Shower on a GPU

Michael H. Seymour, Siddharth Sule

TL;DR

The paper advances GPU-accelerated Monte Carlo event generation by presenting GAPS version 2, an NLO-matched initial- and final-state parton shower implemented on GPUs with a CPU-compatible reference. It introduces non-algorithmic computational improvements—namely partitioning of finished events and kernel tuning—that substantially reduce run time, achieving around 60 seconds for $10^6$ NLO events on a V100 compared to ~1 hour on a 96-core CPU cluster, with comparable energy consumption. Physics validation shows good agreement for $pp \to Z$ observables against Herwig and related NLO tools, confirming correct matching and shower behavior. The work demonstrates the practical viability of GPU-based event generation and outlines clear paths for further performance gains, including 2D kernels and extended hadronisation and MPI integration for a full GPU Event Generator.

Abstract

Recent developments have demonstrated the potential for high simulation speeds and reduced energy consumption by porting Monte Carlo Event Generators to GPUs. We release version 2 of the CUDA C++ parton shower event generator GAPS, which can simulate initial and final state emissions on a GPU and is capable of hard-process matching. As before, we accompany the generator with a near-identical C++ generator to run simulations on single-core and multi-core CPUs. Using these programs, we simulate NLO Z production at the LHC and demonstrate that the speed and energy consumption of an NVIDIA V100 GPU are on par with a 96-core cluster composed of two Intel Xeon Gold 5220R Processors, providing a potential alternative to cluster computing.

An NLO-Matched Initial and Final State Parton Shower on a GPU

TL;DR

The paper advances GPU-accelerated Monte Carlo event generation by presenting GAPS version 2, an NLO-matched initial- and final-state parton shower implemented on GPUs with a CPU-compatible reference. It introduces non-algorithmic computational improvements—namely partitioning of finished events and kernel tuning—that substantially reduce run time, achieving around 60 seconds for NLO events on a V100 compared to ~1 hour on a 96-core CPU cluster, with comparable energy consumption. Physics validation shows good agreement for observables against Herwig and related NLO tools, confirming correct matching and shower behavior. The work demonstrates the practical viability of GPU-based event generation and outlines clear paths for further performance gains, including 2D kernels and extended hadronisation and MPI integration for a full GPU Event Generator.

Abstract

Recent developments have demonstrated the potential for high simulation speeds and reduced energy consumption by porting Monte Carlo Event Generators to GPUs. We release version 2 of the CUDA C++ parton shower event generator GAPS, which can simulate initial and final state emissions on a GPU and is capable of hard-process matching. As before, we accompany the generator with a near-identical C++ generator to run simulations on single-core and multi-core CPUs. Using these programs, we simulate NLO Z production at the LHC and demonstrate that the speed and energy consumption of an NVIDIA V100 GPU are on par with a 96-core cluster composed of two Intel Xeon Gold 5220R Processors, providing a potential alternative to cluster computing.

Paper Structure

This paper contains 26 sections, 66 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The updated parallelised veto algorithm. The PDF evaluations are required for calculating the emission acceptance probability and are therefore performed before that step.
  • Figure 2: Partitioning the event record list. In this case, $N$ events begin showering. After $N/2$ have finished, the events are partitioned into unfinished events, followed by finished events, which is automated using the thrust::partition feature in CUDA nvidia-thrust-2025. After this partitioning, the kernel is launched with $N/2$ threads, leaving the finished events unaffected. In practice, some events finish showering at the same steps, and the kernels are launched with $N-N_{\mathrm{finished}}$ threads. This version allows us to partition the list of event records at any point in the simulation.
  • Figure 3: $Z$ Observables and Anti-$k_T$ jets produced with $R=0.4$. The $Z$ and lepton observables are fully inclusive, while each jet's $p_T$ distribution is shown when its $|\eta|<5$, its $\eta$ distribution is shown when its $p_T>5$ GeV, and the $\Delta R$ and multiplicity distributions are shown when $p_T>5$ GeV and $|\eta|<5$. The $Z$ and lepton observables agree very well with Herwig, the leading jet pretty well, the second and third jets slightly less well.
  • Figure 4: NLO+Shower for the process $p p \to Z$, where the $Z$ boson is on-shell and stable. Like the LO+Shower case, the $Z$ boson observables are in agreement. The jet observables also contained the same deviations and are omitted here.
  • Figure 5: Kernel Tuning Results, with partitioning on and off. For small $N_{EV}$, there is negligible improvement due to the partitioning, and the partitioning even increases the execution time. However, for larger $N_{EV}$ it starts reducing the execution time, offering a $20\%$ reduction in execution time for 1,000,000 events. In terms of kernel tuning, execution times increase slightly with $N_T$ for 10,000 and 100,000 events, but decrease slightly with $N_T$ for 1,000,000 events. The best choice overall is to use $N_T = 128$.
  • ...and 2 more figures