Parallelize Over Data Particle Advection: Participation, Ping Pong Particles, and Overhead
Zhe Wang, Kenneth Moreland, Matthew Larsen, James Kress, Hank Childs, David Pugmire
TL;DR
The paper tackles the scalability challenges of parallel particle advection using Parallelize over Data (POD) in distributed memory systems. It develops a dual-perspective execution-time model with $T_r$ (rank-level) and $T_p$ (particle-level) formulations and introduces rank participation and aggregated rank participation metrics to quantify workload imbalance. A particle-centric analysis reveals that overheads from particle movements between blocks and a ping pong effect across MPI ranks are major contributors to poor performance, particularly when flow features span multiple blocks. The findings highlight the need for load-balancing strategies such as block duplication/merging and early termination to mitigate long-running particles, with practical implications for in situ visualization and HPC particle tracing workflows.
Abstract
Particle advection is one of the foundational algorithms for visualization and analysis and is central to understanding vector fields common to scientific simulations. Achieving efficient performance with large data in a distributed memory setting is notoriously difficult. Because of its simplicity and minimized movement of large vector field data, the Parallelize over Data (POD) algorithm has become a de facto standard. Despite its simplicity and ubiquitous usage, the scaling issues with the POD algorithm are known and have been described throughout the literature. In this paper, we describe a set of in-depth analyses of the POD algorithm that shed new light on the underlying causes for the poor performance of this algorithm. We designed a series of representative workloads to study the performance of the POD algorithm and executed them on a supercomputer while collecting timing and statistical data for analysis. We then performed two different types of analysis. In the first analysis, we introduce two novel metrics for measuring algorithmic efficiency over the course of a workload run. The second analysis was from the perspective of the particles being advected. Using particle centric analysis, we identify that the overheads associated with particle movement between processes (not the communication itself) have a dramatic impact on the overall execution time. In the first analysis, we introduce two novel metrics for measuring algorithmic efficiency over the course of a workload run. The second analysis was from the perspective of the particles being advected. Using particle-centric analysis, we identify that the overheads associated with particle movement between processes have a dramatic impact on the overall execution time. These overheads become particularly costly when flow features span multiple blocks, resulting in repeated particle circulation between blocks.
