Accelerating Particle-in-Cell Monte Carlo Simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming
Jeremy J. Williams, Felix Liu, Jordy Trilaksono, David Tskhakaya, Stefan Costea, Leon Kos, Ales Podolnik, Jakub Hromadka, Pratibha Hegde, Marta Garcia-Gasulla, Valentin Seitz, Frank Jenko, Erwin Laure, Stefano Markidis
TL;DR
This work advances BIT1, a 1D3V PIC Monte Carlo code for divertor plasmas, by integrating MPI with OpenMP and OpenACC and by introducing asynchronous multi-GPU programming to accelerate the particle mover. It develops and evaluates MPI+OpenMP and MPI+OpenACC hybrids, GPU offloading, and asynchronous multi-GPU strategies using OpenMP Target Tasks (nowait, depend) and OpenACC async(n), across multiple leadership-class systems. The results show substantial but system-dependent gains: OpenMP asynchronous multi-GPU achieves higher speedups and parallel efficiency than OpenACC at extreme scales, with notable improvements in mover and total-time, though data movement and inter-node communication remain bottlenecks; on MN5, asynchronous approaches offer strong performance gains and better overlap, signaling progress toward exascale readiness. Overall, the study demonstrates that asynchronous multi-GPU programming can significantly boost throughput and GPU utilization for large-scale PIC simulations, moving BIT1 closer to exascale fusion-plasma research needs.
Abstract
As fusion energy devices advance, plasma simulations are crucial for reactor design. Our work extends BIT1 hybrid parallelization by integrating MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming. Results show significant performance gains: 16 MPI ranks plus OpenMP threads reduced runtime by 53% on a petascale EuroHPC supercomputer, while OpenACC multicore achieved a 58% reduction. At 64 MPI ranks, OpenACC outperformed OpenMP, improving the particle mover function by 24%. On MareNostrum 5, OpenACC async(n) delivered strong performance, but OpenMP asynchronous multi-GPU approach proved more effective at extreme scaling, maintaining efficiency up to 400 GPUs. Speedup and parallel efficiency (PE) studies revealed OpenMP asynchronous multi-GPU achieving 8.77x speedup (54.81% PE), surpassing OpenACC (8.14x speedup, 50.87% PE). While PE declined at high node counts due to communication overhead, asynchronous execution mitigated scalability bottlenecks. OpenMP nowait and depend clauses improved GPU performance via efficient data transfer and task management. Using NVIDIA Nsight tools, we confirmed BIT1 efficiency for large-scale plasma simulations. OpenMP asynchronous multi-GPU implementation delivered exceptional performance in portability, high throughput, and GPU utilization, positioning BIT1 for exascale supercomputing and advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.
