Table of Contents
Fetching ...

Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations

Jeremy J. Williams, David Tskhakaya, Stefan Costea, Ivy B. Peng, Marta Garcia-Gasulla, Stefano Markidis

TL;DR

The paper tackles performance characterization of the BIT1 PIC/Monte Carlo code for plasma simulations on HPC systems. It uses a comprehensive profiling/tracing workflow to identify memory-bound behavior, with the arrj sorting function as the dominant bottleneck, and demonstrates strong scaling up to thousands of cores alongside notable MPI load imbalance and I/O bottlenecks. The study shows that optimizing data layout, memory hierarchy utilization, and parallel I/O are essential for performance gains, while porting BIT1 to GPUs would require substantial algorithmic reformulation to increase computational intensity. The findings provide practical guidance for optimizing large-scale PIC/MC codes and highlight strategies such as high-bandwidth memory usage, MPI de-synchronization, and in-situ analysis to improve throughput in fusion-relevant plasma simulations.

Abstract

Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs.

Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations

TL;DR

The paper tackles performance characterization of the BIT1 PIC/Monte Carlo code for plasma simulations on HPC systems. It uses a comprehensive profiling/tracing workflow to identify memory-bound behavior, with the arrj sorting function as the dominant bottleneck, and demonstrates strong scaling up to thousands of cores alongside notable MPI load imbalance and I/O bottlenecks. The study shows that optimizing data layout, memory hierarchy utilization, and parallel I/O are essential for performance gains, while porting BIT1 to GPUs would require substantial algorithmic reformulation to increase computational intensity. The findings provide practical guidance for optimizing large-scale PIC/MC codes and highlight strategies such as high-bandwidth memory usage, MPI de-synchronization, and in-situ analysis to improve throughput in fusion-relevant plasma simulations.

Abstract

Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs.
Paper Structure (9 sections, 7 figures, 1 table)

This paper contains 9 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: A diagram representing the algorithm used in BIT1.
  • Figure 2: Impact of the gcc optimization flags for the Ionization and Sheath test cases.
  • Figure 3: Percentage breakdown of the BIT1 functions where most of the execution time is spent for the Ionization and Sheath baseline cases. The arrj sorting function (in yellow colour) is the function that takes most of the time. The gprof tool have been used.
  • Figure 4: The MPI communication pattern is obtained, using Extrae/Paraver, with BIT1 using eight MPI processes on Dardel. The trace shows that MPI communication is non-blocking point-to-point and only involves neighboring processes. The MPI Rank 0 is the slowest. MPI ranks 1 and 7 wait for it, leading to a load imbalance.
  • Figure 5: BIT1 strong scaling test execution times on Dardel supercomputer for the Ionization (blue line) and Sheath (yellow line).
  • ...and 2 more figures