Table of Contents
Fetching ...

Parallelize Over Data Particle Advection: Participation, Ping Pong Particles, and Overhead

Zhe Wang, Kenneth Moreland, Matthew Larsen, James Kress, Hank Childs, David Pugmire

TL;DR

The paper tackles the scalability challenges of parallel particle advection using Parallelize over Data (POD) in distributed memory systems. It develops a dual-perspective execution-time model with $T_r$ (rank-level) and $T_p$ (particle-level) formulations and introduces rank participation and aggregated rank participation metrics to quantify workload imbalance. A particle-centric analysis reveals that overheads from particle movements between blocks and a ping pong effect across MPI ranks are major contributors to poor performance, particularly when flow features span multiple blocks. The findings highlight the need for load-balancing strategies such as block duplication/merging and early termination to mitigate long-running particles, with practical implications for in situ visualization and HPC particle tracing workflows.

Abstract

Particle advection is one of the foundational algorithms for visualization and analysis and is central to understanding vector fields common to scientific simulations. Achieving efficient performance with large data in a distributed memory setting is notoriously difficult. Because of its simplicity and minimized movement of large vector field data, the Parallelize over Data (POD) algorithm has become a de facto standard. Despite its simplicity and ubiquitous usage, the scaling issues with the POD algorithm are known and have been described throughout the literature. In this paper, we describe a set of in-depth analyses of the POD algorithm that shed new light on the underlying causes for the poor performance of this algorithm. We designed a series of representative workloads to study the performance of the POD algorithm and executed them on a supercomputer while collecting timing and statistical data for analysis. We then performed two different types of analysis. In the first analysis, we introduce two novel metrics for measuring algorithmic efficiency over the course of a workload run. The second analysis was from the perspective of the particles being advected. Using particle centric analysis, we identify that the overheads associated with particle movement between processes (not the communication itself) have a dramatic impact on the overall execution time. In the first analysis, we introduce two novel metrics for measuring algorithmic efficiency over the course of a workload run. The second analysis was from the perspective of the particles being advected. Using particle-centric analysis, we identify that the overheads associated with particle movement between processes have a dramatic impact on the overall execution time. These overheads become particularly costly when flow features span multiple blocks, resulting in repeated particle circulation between blocks.

Parallelize Over Data Particle Advection: Participation, Ping Pong Particles, and Overhead

TL;DR

The paper tackles the scalability challenges of parallel particle advection using Parallelize over Data (POD) in distributed memory systems. It develops a dual-perspective execution-time model with (rank-level) and (particle-level) formulations and introduces rank participation and aggregated rank participation metrics to quantify workload imbalance. A particle-centric analysis reveals that overheads from particle movements between blocks and a ping pong effect across MPI ranks are major contributors to poor performance, particularly when flow features span multiple blocks. The findings highlight the need for load-balancing strategies such as block duplication/merging and early termination to mitigate long-running particles, with practical implications for in situ visualization and HPC particle tracing workflows.

Abstract

Particle advection is one of the foundational algorithms for visualization and analysis and is central to understanding vector fields common to scientific simulations. Achieving efficient performance with large data in a distributed memory setting is notoriously difficult. Because of its simplicity and minimized movement of large vector field data, the Parallelize over Data (POD) algorithm has become a de facto standard. Despite its simplicity and ubiquitous usage, the scaling issues with the POD algorithm are known and have been described throughout the literature. In this paper, we describe a set of in-depth analyses of the POD algorithm that shed new light on the underlying causes for the poor performance of this algorithm. We designed a series of representative workloads to study the performance of the POD algorithm and executed them on a supercomputer while collecting timing and statistical data for analysis. We then performed two different types of analysis. In the first analysis, we introduce two novel metrics for measuring algorithmic efficiency over the course of a workload run. The second analysis was from the perspective of the particles being advected. Using particle centric analysis, we identify that the overheads associated with particle movement between processes (not the communication itself) have a dramatic impact on the overall execution time. In the first analysis, we introduce two novel metrics for measuring algorithmic efficiency over the course of a workload run. The second analysis was from the perspective of the particles being advected. Using particle-centric analysis, we identify that the overheads associated with particle movement between processes have a dramatic impact on the overall execution time. These overheads become particularly costly when flow features span multiple blocks, resulting in repeated particle circulation between blocks.

Paper Structure

This paper contains 26 sections, 4 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Images of streamlines generated from the five data sets used in this study.
  • Figure 2: POD particle advection as parallelism increases, organized by dataset. The top row shows the execution time of the particle advection algorithm, while the bottom row shows weak scalability. The X-Axis for all sub-figures is the number of ranks ($log_2$ scale). The Y-Axis for the top row is execution time ($log_{10}$ scale) and for the bottom row is efficiency relative to the $8$-rank case.
  • Figure 3: Gantt charts for experiments consisting of 128 ranks and 2000 advection steps. Each chart shows the activity for each rank over the course of the run: blue regions denote advection time, white regions represent both communication time and wait time, and pink regions represent other overheads.
  • Figure 4: Rank participation values of all evaluated datasets based on experiment results from Figure \ref{['fig:results:gantt']}. The $x$ axis represents the execution time of the workload, and the $y$ axis represents the corresponding rank participation value at each moment.
  • Figure 5: Aggregated participation for all experiments. The five subplots correspond to the five datasets, and each colored line within a subplot corresponds to the number of ranks. Note that the X-Axis is showing behavior as the number of advection steps to take increases, i.e., the tick marks at 1000 correspond to the behavior across an entire experiment that has particles travel 1000 steps, while the tick marks at 2000 correspond to different experiments where the particles travel 2000 steps.
  • ...and 7 more figures