Table of Contents
Fetching ...

Configurable Non-uniform All-to-all Algorithms

Ke Fan, Jens Domke, Seydou Ba, Sidharth Kumar

TL;DR

A set of Tunable Non-uniform All-to-all algorithms, denoted TuNA{l}{g}, where g and l refer to global and local communication hierarchies, that efficiently addresses the trade-off between bandwidth maximization and latency minimization that existing implementations struggle to optimize.

Abstract

MPI_Alltoallv generalizes the uniform all-to-all communication (MPI_Alltoall) by enabling the exchange of data blocks of varied sizes among processes. This function plays a crucial role in many applications, such as FFT computation and relational algebra operations. Popular MPI libraries, such as MPICH and OpenMPI, implement MPI_Alltoall using a combination of linear and logarithmic algorithms. However, MPI_Alltoallv typically relies only on variations of linear algorithms, missing the benefits of logarithmic approaches. Furthermore, current algorithms also overlook the intricacies of modern HPC system architectures, such as the significant performance gap between intra-node (local) and inter-node (global) communication. This paper introduces a set of Tunable Non-uniform All-to-all algorithms, denoted TuNA{l}{g}, where g and l refer to global (inter-node) and local (intra-node) communication hierarchies.These algorithms consider key factors such as the hierarchical architecture of HPC systems, network congestion, the number of data exchange rounds, and the communication burst size. The algorithm efficiently addresses the trade-off between bandwidth maximization and latency minimization that existing implementations struggle to optimize. We show a performance improvement over the state-of-the-art implementations by factors of 42x and 138x on Polaris and Fugaku, respectively.

Configurable Non-uniform All-to-all Algorithms

TL;DR

A set of Tunable Non-uniform All-to-all algorithms, denoted TuNA{l}{g}, where g and l refer to global and local communication hierarchies, that efficiently addresses the trade-off between bandwidth maximization and latency minimization that existing implementations struggle to optimize.

Abstract

MPI_Alltoallv generalizes the uniform all-to-all communication (MPI_Alltoall) by enabling the exchange of data blocks of varied sizes among processes. This function plays a crucial role in many applications, such as FFT computation and relational algebra operations. Popular MPI libraries, such as MPICH and OpenMPI, implement MPI_Alltoall using a combination of linear and logarithmic algorithms. However, MPI_Alltoallv typically relies only on variations of linear algorithms, missing the benefits of logarithmic approaches. Furthermore, current algorithms also overlook the intricacies of modern HPC system architectures, such as the significant performance gap between intra-node (local) and inter-node (global) communication. This paper introduces a set of Tunable Non-uniform All-to-all algorithms, denoted TuNA{l}{g}, where g and l refer to global (inter-node) and local (intra-node) communication hierarchies.These algorithms consider key factors such as the hierarchical architecture of HPC systems, network congestion, the number of data exchange rounds, and the communication burst size. The algorithm efficiently addresses the trade-off between bandwidth maximization and latency minimization that existing implementations struggle to optimize. We show a performance improvement over the state-of-the-art implementations by factors of 42x and 138x on Polaris and Fugaku, respectively.

Paper Structure

This paper contains 21 sections, 16 figures, 3 algorithms.

Figures (16)

  • Figure 1: Interplay of proposed parameterized algorithms with existing foundational approaches.
  • Figure 2: Example of the $\text{TuNA}$ with $P = 4$ and $r = 2$. (A) is the initial state. $S$ is made of $4$ data-blocks (of different sizes), shown in different colors. (B) shows the rotated data-block indices and their matching binary representation for $P\mathit{1}$. (C) and (D) illustrate two communication rounds for $P\mathit{1}$. A two-phase communication scheme is employed in each round: 1 metadata exchange, and 2 actual data exchange.
  • Figure 3: Examples of memory optimization with three configurations, each showing a single process and the data blocks exchanged per communication round. In each round, green blocks reach their destination, while blue blocks are temporarily stored in $T$ for transfer in later rounds. Meanwhile, green blocks with red boxes are sent only once during the entire communication, allowing their space in $T$ to be omitted.
  • Figure 4: Two intra-node strategies: (a) explicit and (b) implicit (ours). Assuming data blocks on each node are logically divided into $N=3$ groups, each process within a node has $Q=3$ data-blocks per group. An explicit strategy performs all-to-all only within the group whose index matches the node's ID. Our approach performs all-to-all within each group.
  • Figure 5: An example of $\text{TuNA}_\text{\scriptsize l}^\text{\scriptsize g}$ when $P = 15$, $N = 3$, $r = 2$ and $Q = 5$. (1) depicts the initial state in $S$ for all processes within node $N\mathit{0}$. Each process logically has $Q$ data-blocks in each $N$ group (separated by red boxes). 0 shows the rotation index array for $P\mathit{0}$ based on the group rank ID ( $g = p\ \%\ Q$ ). 123 illustrate the three intra-node communication steps, where the sent data-blocks are colored. (2) presents the data status in $R$ and $T$ after the intra-node communication, where $R$ holds the data-blocks destined the processes within $N\mathit{0}$. 45 depict two communication steps for inter-node communication. a and b are two communication patterns (matching \ref{['fig:inter_strag']}). Finally, each process receives the required data-blocks in $R$.
  • ...and 11 more figures