Table of Contents
Fetching ...

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

John Tramm, Paul Romano, Patrick Shriwise, Amanda Lund, Johannes Doerfert, Patrick Steinbrecher, Andrew Siegel, Gavin Ridley

TL;DR

OpenMC's GPU port using OpenMP target offloading is evaluated for performance portability across AMD, Intel, and NVIDIA GPUs on Frontier, Polaris, and Aurora. The study analyzes event-based GPU parallelism, sorting, and other optimizations, benchmarking against CPU baselines and other CPU MC codes. Results show robust cross-vendor performance, including exceptionally strong weak scaling and high per-node throughput, with Intel Ponte Vecchio GPUs delivering leading performance on depleted-fuel SMR problems. The work demonstrates the viability of portable GPU-based Monte Carlo simulations at exascale scales and highlights the potential of OpenMP offloading for production HPC codes.

Abstract

OpenMC is an open source Monte Carlo neutral particle transport application that has recently been ported to GPU using the OpenMP target offloading model. We examine the performance of OpenMC at scale on the Frontier, Polaris, and Aurora supercomputers, demonstrating that performance portability has been achieved by OpenMC across all three major GPU vendors (AMD, NVIDIA, and Intel). OpenMC's GPU performance is compared to both the traditional CPU-based version of OpenMC as well as several other state-of-the-art CPU-based Monte Carlo particle transport applications. We also provide historical context by analyzing OpenMC's performance on several legacy GPU and CPU architectures. This work includes some of the first published results for a scientific simulation application at scale on a supercomputer featuring Intel's Max series "Ponte Vecchio" GPUs. It is also one of the first demonstrations of a large scientific production application using the OpenMP target offloading model to achieve high performance on all three major GPU platforms.

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

TL;DR

OpenMC's GPU port using OpenMP target offloading is evaluated for performance portability across AMD, Intel, and NVIDIA GPUs on Frontier, Polaris, and Aurora. The study analyzes event-based GPU parallelism, sorting, and other optimizations, benchmarking against CPU baselines and other CPU MC codes. Results show robust cross-vendor performance, including exceptionally strong weak scaling and high per-node throughput, with Intel Ponte Vecchio GPUs delivering leading performance on depleted-fuel SMR problems. The work demonstrates the viability of portable GPU-based Monte Carlo simulations at exascale scales and highlights the potential of OpenMP offloading for production HPC codes.

Abstract

OpenMC is an open source Monte Carlo neutral particle transport application that has recently been ported to GPU using the OpenMP target offloading model. We examine the performance of OpenMC at scale on the Frontier, Polaris, and Aurora supercomputers, demonstrating that performance portability has been achieved by OpenMC across all three major GPU vendors (AMD, NVIDIA, and Intel). OpenMC's GPU performance is compared to both the traditional CPU-based version of OpenMC as well as several other state-of-the-art CPU-based Monte Carlo particle transport applications. We also provide historical context by analyzing OpenMC's performance on several legacy GPU and CPU architectures. This work includes some of the first published results for a scientific simulation application at scale on a supercomputer featuring Intel's Max series "Ponte Vecchio" GPUs. It is also one of the first demonstrations of a large scientific production application using the OpenMP target offloading model to achieve high performance on all three major GPU platforms.
Paper Structure (17 sections, 4 figures, 1 table)

This paper contains 17 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison between other state-of-the-art Monte Carlo particle transport codes and OpenMC on a depleted pincell benchmark problem. Performance is measured for inactive batches.
  • Figure 2: Performance comparison of OpenMC on various node architectures. Performance is given in terms of active batch (i.e., with tally) particle rates on a depleted small modular reactor problem with 195k material regions.
  • Figure 3: Weak scaling (where the problem size per GPU remains constant) performance of OpenMC on various architectures. Performance is given in terms of active batch (i.e., with tally) performance on a depleted small modular reactor problem with 195k material regions.
  • Figure 4: Performance of OpenMC on legacy CPU and GPU architectures by release date for a depleted pincell benchmark problem.