Table of Contents
Fetching ...

Accelerating Particle-in-Cell Monte Carlo Simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming

Jeremy J. Williams, Felix Liu, Jordy Trilaksono, David Tskhakaya, Stefan Costea, Leon Kos, Ales Podolnik, Jakub Hromadka, Pratibha Hegde, Marta Garcia-Gasulla, Valentin Seitz, Frank Jenko, Erwin Laure, Stefano Markidis

TL;DR

This work advances BIT1, a 1D3V PIC Monte Carlo code for divertor plasmas, by integrating MPI with OpenMP and OpenACC and by introducing asynchronous multi-GPU programming to accelerate the particle mover. It develops and evaluates MPI+OpenMP and MPI+OpenACC hybrids, GPU offloading, and asynchronous multi-GPU strategies using OpenMP Target Tasks (nowait, depend) and OpenACC async(n), across multiple leadership-class systems. The results show substantial but system-dependent gains: OpenMP asynchronous multi-GPU achieves higher speedups and parallel efficiency than OpenACC at extreme scales, with notable improvements in mover and total-time, though data movement and inter-node communication remain bottlenecks; on MN5, asynchronous approaches offer strong performance gains and better overlap, signaling progress toward exascale readiness. Overall, the study demonstrates that asynchronous multi-GPU programming can significantly boost throughput and GPU utilization for large-scale PIC simulations, moving BIT1 closer to exascale fusion-plasma research needs.

Abstract

As fusion energy devices advance, plasma simulations are crucial for reactor design. Our work extends BIT1 hybrid parallelization by integrating MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming. Results show significant performance gains: 16 MPI ranks plus OpenMP threads reduced runtime by 53% on a petascale EuroHPC supercomputer, while OpenACC multicore achieved a 58% reduction. At 64 MPI ranks, OpenACC outperformed OpenMP, improving the particle mover function by 24%. On MareNostrum 5, OpenACC async(n) delivered strong performance, but OpenMP asynchronous multi-GPU approach proved more effective at extreme scaling, maintaining efficiency up to 400 GPUs. Speedup and parallel efficiency (PE) studies revealed OpenMP asynchronous multi-GPU achieving 8.77x speedup (54.81% PE), surpassing OpenACC (8.14x speedup, 50.87% PE). While PE declined at high node counts due to communication overhead, asynchronous execution mitigated scalability bottlenecks. OpenMP nowait and depend clauses improved GPU performance via efficient data transfer and task management. Using NVIDIA Nsight tools, we confirmed BIT1 efficiency for large-scale plasma simulations. OpenMP asynchronous multi-GPU implementation delivered exceptional performance in portability, high throughput, and GPU utilization, positioning BIT1 for exascale supercomputing and advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.

Accelerating Particle-in-Cell Monte Carlo Simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming

TL;DR

This work advances BIT1, a 1D3V PIC Monte Carlo code for divertor plasmas, by integrating MPI with OpenMP and OpenACC and by introducing asynchronous multi-GPU programming to accelerate the particle mover. It develops and evaluates MPI+OpenMP and MPI+OpenACC hybrids, GPU offloading, and asynchronous multi-GPU strategies using OpenMP Target Tasks (nowait, depend) and OpenACC async(n), across multiple leadership-class systems. The results show substantial but system-dependent gains: OpenMP asynchronous multi-GPU achieves higher speedups and parallel efficiency than OpenACC at extreme scales, with notable improvements in mover and total-time, though data movement and inter-node communication remain bottlenecks; on MN5, asynchronous approaches offer strong performance gains and better overlap, signaling progress toward exascale readiness. Overall, the study demonstrates that asynchronous multi-GPU programming can significantly boost throughput and GPU utilization for large-scale PIC simulations, moving BIT1 closer to exascale fusion-plasma research needs.

Abstract

As fusion energy devices advance, plasma simulations are crucial for reactor design. Our work extends BIT1 hybrid parallelization by integrating MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming. Results show significant performance gains: 16 MPI ranks plus OpenMP threads reduced runtime by 53% on a petascale EuroHPC supercomputer, while OpenACC multicore achieved a 58% reduction. At 64 MPI ranks, OpenACC outperformed OpenMP, improving the particle mover function by 24%. On MareNostrum 5, OpenACC async(n) delivered strong performance, but OpenMP asynchronous multi-GPU approach proved more effective at extreme scaling, maintaining efficiency up to 400 GPUs. Speedup and parallel efficiency (PE) studies revealed OpenMP asynchronous multi-GPU achieving 8.77x speedup (54.81% PE), surpassing OpenACC (8.14x speedup, 50.87% PE). While PE declined at high node counts due to communication overhead, asynchronous execution mitigated scalability bottlenecks. OpenMP nowait and depend clauses improved GPU performance via efficient data transfer and task management. Using NVIDIA Nsight tools, we confirmed BIT1 efficiency for large-scale plasma simulations. OpenMP asynchronous multi-GPU implementation delivered exceptional performance in portability, high throughput, and GPU utilization, positioning BIT1 for exascale supercomputing and advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.
Paper Structure (15 sections, 27 figures, 1 table)

This paper contains 15 sections, 27 figures, 1 table.

Figures (27)

  • Figure 1: BIT1 simulates plasma behavior in the tokamak divertor region (blue arrow), such as in the ITER fusion device.
  • Figure 2: A diagram representing the algorithm used in BIT1. After the initialization the PIC algorithm cycle is repeated at each time step. In orange, we highlight the particle mover step that we parallelize with OpenMP and OpenACC.
  • Figure 3: A simple diagram showing two neighboring particles in conventional PIC codes (a) and in BIT1's new approach (b), where in (a), particles that are neighbors in space may not be adjacent in memory, whereas in (b), particles that are neighbors in space are also adjacent in memory and organized according to spatial cells.
  • Figure 4: Hybrid BIT1 total simulation and optimized mover function using 2 and 16 ranks per node on NJ for 1000 times steps.
  • Figure 5: Hybrid BIT1 total simulation and optimized mover function using 16 and 64 ranks per node on Vega for 20000 times steps.
  • ...and 22 more figures