On the performance of two-sided MPI, MPI-3 RMA and SHMEM in a Lagrangian particle cluster algorithm
Matthias Frey, Douglas Shanks, Steven Böing, Rui F. G. Apóstolo
TL;DR
This work addresses merging small Lagrangian parcels in an $N$-body context by performing a $k=1$ nearest-neighbour search in $3$-D space and two-stage directed graph pruning. It systematically compares three distributed-memory models—MPI two-sided, MPI-3 RMA, and SHMEM—within the EPIC-based parcel clustering framework. The study finds that MPI P2P generally provides robust, scalable performance across interconnects, while SHMEM's performance is highly network-dependent and MPI-3 RMA's scalability depends on workload balance; results are sensitive to 4-byte data transfers and interconnect bandwidth/latency. The authors discuss the limitations of epoch-based RMA usage and suggest avenues for optimization, such as single-epoch strategies per pruning stage and alternative communication approaches, with implications for large-scale geophysical fluid simulations.
Abstract
In this paper, we compare the parallel performance of three distributed-memory communication models for a cluster algorithm based on a nearest neighbour search algorithm for N-body simulations. The nearest neighbour is defined by the Euclidean distance in three-dimensional space. The resulting directed nearest neighbour graphs that are used to define the clusters are pruned in an iterative procedure where we use either point-to-point message passing interface (MPI), MPI-3 remote memory access (RMA), or SHMEM communication. The original algorithm has been developed and implemented as part of the elliptical parcel-in-cell (EPIC) method targeting geophysical fluid flows. The parallel scalability of the algorithm is discussed by means of an artificial and a standard fluid dynamics test case. Performance measurements were carried out on three different computing systems with InfiniBand FDR, Hewlett Packard Enterprise (HPE) Slingshot 10 or HPE Slingshot 200 interconnect.
