Table of Contents
Fetching ...

Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations for Exascale Computing Systems

Jeremy J. Williams, Jordy Trilaksono, Stefan Costea, Yi Ju, Luca Pennati, Jonah Ekelund, David Tskhakaya, Leon Kos, Ales Podolnik, Jakub Hromadka, Allen D. Malony, Sameer Shende, Tilman Dannert, Frank Jenko, Erwin Laure, Stefano Markidis

Abstract

Particle-in-Cell (PIC) Monte Carlo (MC) simulations are central to plasma physics but face increasing challenges on heterogeneous HPC systems due to excessive data movement, synchronization overheads, and inefficient utilization of multiple accelerators. In this work, we present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that enables scalable execution on both Nvidia and AMD accelerators through OpenMP target tasks with explicit dependencies to overlap computation and communication across devices. Portability is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, and a transition from unified to pinned host memory to improve large data-transfer efficiency, together with GPU Direct Memory Access (DMA) and runtime interoperability for direct device-pointer access. Standardized and scalable I/O is provided using openPMD and ADIOS2, supporting high-performance file I/O, in-memory data streaming, and in-situ analysis and visualization. Performance results on pre-exascale and exascale systems, including Frontier (OLCF-5) for up to 16,000 GPUs, demonstrate significant improvements in run time, scalability, and resource utilization for large-scale PIC MC simulations.

Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations for Exascale Computing Systems

Abstract

Particle-in-Cell (PIC) Monte Carlo (MC) simulations are central to plasma physics but face increasing challenges on heterogeneous HPC systems due to excessive data movement, synchronization overheads, and inefficient utilization of multiple accelerators. In this work, we present a portable, multi-GPU hybrid MPI+OpenMP implementation of BIT1 that enables scalable execution on both Nvidia and AMD accelerators through OpenMP target tasks with explicit dependencies to overlap computation and communication across devices. Portability is achieved through persistent device-resident memory, an optimized contiguous one-dimensional data layout, and a transition from unified to pinned host memory to improve large data-transfer efficiency, together with GPU Direct Memory Access (DMA) and runtime interoperability for direct device-pointer access. Standardized and scalable I/O is provided using openPMD and ADIOS2, supporting high-performance file I/O, in-memory data streaming, and in-situ analysis and visualization. Performance results on pre-exascale and exascale systems, including Frontier (OLCF-5) for up to 16,000 GPUs, demonstrate significant improvements in run time, scalability, and resource utilization for large-scale PIC MC simulations.

Paper Structure

This paper contains 14 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: A diagram representing the PIC Method on HPC architectures. After initialization, the PIC method repeats at each time step. In gray, we highlight the particle mover step that we parallelize in the portable multi-GPU hybrid BIT1.
  • Figure 2: Ionization case function percentage breakdown (using gprof) on Dardel, showing where most of the execution time is spent for Original BIT1, openPMD BP4, and openPMD SST simulations williams2023leveragingwilliams2024understandingwilliams2026integrating. The arrj sorting function (yellow) dominates but drops from 75.5% (Original BIT1) to 65.5% (BP4) and 35.5% (SST).
  • Figure 3: Hybrid BIT1 (Ionization Case) Total Simulation (Development Progression) strong scaling on 1 Node (4 MPI ranks & 4 GPUs) on MN5 ACC for 2K times steps.
  • Figure 4: Hybrid BIT1 (Sheath) Total Simulation (Relative) Speed Up (left) and PE (Right) - Strong and Weak Scaling up to 100 Nodes (up to 800 GPUs) on MN5 ACC, LUMI-G and Frontier for 2K times steps.
  • Figure 5: Hybrid BIT1 (Sheath) Total Simulation (Relative) Speed Up (left) and PE (Right) - Strong and Weak Scaling up to 2000 Nodes (up to 16,000 GPUs) on Frontier for 10K times steps.
  • ...and 2 more figures