On the performance of two-sided MPI, MPI-3 RMA and SHMEM in a Lagrangian particle cluster algorithm

Matthias Frey; Douglas Shanks; Steven Böing; Rui F. G. Apóstolo

On the performance of two-sided MPI, MPI-3 RMA and SHMEM in a Lagrangian particle cluster algorithm

Matthias Frey, Douglas Shanks, Steven Böing, Rui F. G. Apóstolo

TL;DR

This work addresses merging small Lagrangian parcels in an $N$-body context by performing a $k=1$ nearest-neighbour search in $3$-D space and two-stage directed graph pruning. It systematically compares three distributed-memory models—MPI two-sided, MPI-3 RMA, and SHMEM—within the EPIC-based parcel clustering framework. The study finds that MPI P2P generally provides robust, scalable performance across interconnects, while SHMEM's performance is highly network-dependent and MPI-3 RMA's scalability depends on workload balance; results are sensitive to 4-byte data transfers and interconnect bandwidth/latency. The authors discuss the limitations of epoch-based RMA usage and suggest avenues for optimization, such as single-epoch strategies per pruning stage and alternative communication approaches, with implications for large-scale geophysical fluid simulations.

Abstract

In this paper, we compare the parallel performance of three distributed-memory communication models for a cluster algorithm based on a nearest neighbour search algorithm for N-body simulations. The nearest neighbour is defined by the Euclidean distance in three-dimensional space. The resulting directed nearest neighbour graphs that are used to define the clusters are pruned in an iterative procedure where we use either point-to-point message passing interface (MPI), MPI-3 remote memory access (RMA), or SHMEM communication. The original algorithm has been developed and implemented as part of the elliptical parcel-in-cell (EPIC) method targeting geophysical fluid flows. The parallel scalability of the algorithm is discussed by means of an artificial and a standard fluid dynamics test case. Performance measurements were carried out on three different computing systems with InfiniBand FDR, Hewlett Packard Enterprise (HPE) Slingshot 10 or HPE Slingshot 200 interconnect.

On the performance of two-sided MPI, MPI-3 RMA and SHMEM in a Lagrangian particle cluster algorithm

TL;DR

This work addresses merging small Lagrangian parcels in an

-body context by performing a

nearest-neighbour search in

-D space and two-stage directed graph pruning. It systematically compares three distributed-memory models—MPI two-sided, MPI-3 RMA, and SHMEM—within the EPIC-based parcel clustering framework. The study finds that MPI P2P generally provides robust, scalable performance across interconnects, while SHMEM's performance is highly network-dependent and MPI-3 RMA's scalability depends on workload balance; results are sensitive to 4-byte data transfers and interconnect bandwidth/latency. The authors discuss the limitations of epoch-based RMA usage and suggest avenues for optimization, such as single-epoch strategies per pruning stage and alternative communication approaches, with implications for large-scale geophysical fluid simulations.

Abstract

Paper Structure (12 sections, 21 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 21 figures, 5 tables, 1 algorithm.

Introduction
Nearest neighbour cluster algorithm
Directed graph (DG) pruning algorithm
Graph pruning algorithm with MPI-3 RMA communication
Graph pruning algorithm with MPI point-to-point communication
Graph pruning algorithm with SHMEM communication
Parallel performance analysis
Latency and bandwidth tests
Parcel cluster algorithm
Example: Artificial parcel configuration
Example: Rayleigh-Taylor instability
Conclusions

Figures (21)

Figure 1: An example of an unweighted directed graph (DG). We call all nodes without an incoming edge leaf vertices. This DG has leaf vertices A, B, G, J, K and L.
Figure 2: Directed graph pruning step for the graph illustrated in \ref{['fig:dg']}. The algorithm consists of two stages. In the first stage, an iterative procedure performs two iterations illustrated in (a) and (b). The second stage eliminates all dual links as shown in (c). After these stages we are left with four smaller subgraphs.
Figure 3: Incomplete Fortran sample code to demonstrate RMA operations with either active or passive target communication on a MPI window win. Note: Instead of MPI_Put, a call to MPI_Get is also possible.
Figure 4: Illustration of an RMA epoch with MPI put operations using either the separate or unified memory model for the remote memory access communication. In the unified model there is no distinction between public and private copy. In the separate memory model synchronisation calls (i.e. MPI_Win_sync) ensure memory coherence between the private and public copy. Such calls are denoted by the arrows labelled 'sync'. Note that the result of sync is not reflected by the figure.
Figure 5: Emulating remote memory access with point-to-point communication. Left: process A and process B write to their local and buffer memory. When an epoch ends, a synchronisation call (symbolised with the arrows labelled 'sync') updates the buffer memory of process A with the local memory of process B, and vice-versa. Right: During an epoch of get operations, all accesses are performed locally, i.e. no communication is required.
...and 16 more figures

On the performance of two-sided MPI, MPI-3 RMA and SHMEM in a Lagrangian particle cluster algorithm

TL;DR

Abstract

On the performance of two-sided MPI, MPI-3 RMA and SHMEM in a Lagrangian particle cluster algorithm

Authors

TL;DR

Abstract

Table of Contents

Figures (21)