Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations
Jeremy J. Williams, David Tskhakaya, Stefan Costea, Ivy B. Peng, Marta Garcia-Gasulla, Stefano Markidis
TL;DR
The paper tackles performance characterization of the BIT1 PIC/Monte Carlo code for plasma simulations on HPC systems. It uses a comprehensive profiling/tracing workflow to identify memory-bound behavior, with the arrj sorting function as the dominant bottleneck, and demonstrates strong scaling up to thousands of cores alongside notable MPI load imbalance and I/O bottlenecks. The study shows that optimizing data layout, memory hierarchy utilization, and parallel I/O are essential for performance gains, while porting BIT1 to GPUs would require substantial algorithmic reformulation to increase computational intensity. The findings provide practical guidance for optimizing large-scale PIC/MC codes and highlight strategies such as high-bandwidth memory usage, MPI de-synchronization, and in-situ analysis to improve throughput in fusion-relevant plasma simulations.
Abstract
Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs.
