Table of Contents
Fetching ...

Optimal, Non-pipelined Reduce-scatter and Allreduce Algorithms

Jesper Larsson Träff

TL;DR

The paper tackles efficient partitioned reduction and replication across $p$ processors for reduce-scatter and allreduce. It introduces a uniform, non-pipelined reduce-scatter algorithm on a $\lceil \log_2 p\rceil$-round circulant graph with halving skips, achieving $p-1$ block transfers per processor and a tight volume bound, formalized as $W = \bigoplus_{i=0}^{p-1} V_i[r]$. An allreduce algorithm is obtained by composing reduce-scatter with an allgather, using a stack-based reversal of the skip sequence to preserve order, resulting in $2\lceil \log_2 p\rceil$ rounds and $2(p-1)$ blocks per processor. The methods map directly to MPI collectives (MPI_Reduce_scatter_block, MPI_Reduce_scatter, MPI_Allreduce) and offer a universal, simple framework that can extend to all-to-all and related primitives. Overall, the work delivers round- and volume-optimal, easy-to-implement algorithms with practical implications for MPI-based high-performance computing.

Abstract

The reduce-scatter collective operation in which $p$ processors in a network of processors collectively reduce $p$ input vectors into a result vector that is partitioned over the processors is important both in its own right and as building block for other collective operations. We present a surprisingly simple, but non-trivial algorithm for solving this problem optimally in $\lceil\log_2 p\rceil$ communication rounds with each processor sending, receiving and reducing exactly $p-1$ blocks of vector elements. We combine this with a similarly simple, well-known allgather algorithm to get a volume optimal algorithm for the allreduce collective operation where the result vector is replicated on all processors. The communication pattern is a simple, $\lceil\log_2 p\rceil$-regular, circulant graph also used elsewhere. The algorithms assume the binary reduction operator to be commutative and we discuss this assumption. The algorithms can readily be implemented and used for the collective operations MPI_Reduce_scatter_block, MPI_Reduce_scatter and MPI_Allreduce as specified in the MPI standard. We also observe that the reduce-scatter algorithm can be used as a template for round-optimal all-to-all communication and the collective MPI_Alltoall operation.

Optimal, Non-pipelined Reduce-scatter and Allreduce Algorithms

TL;DR

The paper tackles efficient partitioned reduction and replication across processors for reduce-scatter and allreduce. It introduces a uniform, non-pipelined reduce-scatter algorithm on a -round circulant graph with halving skips, achieving block transfers per processor and a tight volume bound, formalized as . An allreduce algorithm is obtained by composing reduce-scatter with an allgather, using a stack-based reversal of the skip sequence to preserve order, resulting in rounds and blocks per processor. The methods map directly to MPI collectives (MPI_Reduce_scatter_block, MPI_Reduce_scatter, MPI_Allreduce) and offer a universal, simple framework that can extend to all-to-all and related primitives. Overall, the work delivers round- and volume-optimal, easy-to-implement algorithms with practical implications for MPI-based high-performance computing.

Abstract

The reduce-scatter collective operation in which processors in a network of processors collectively reduce input vectors into a result vector that is partitioned over the processors is important both in its own right and as building block for other collective operations. We present a surprisingly simple, but non-trivial algorithm for solving this problem optimally in communication rounds with each processor sending, receiving and reducing exactly blocks of vector elements. We combine this with a similarly simple, well-known allgather algorithm to get a volume optimal algorithm for the allreduce collective operation where the result vector is replicated on all processors. The communication pattern is a simple, -regular, circulant graph also used elsewhere. The algorithms assume the binary reduction operator to be commutative and we discuss this assumption. The algorithms can readily be implemented and used for the collective operations MPI_Reduce_scatter_block, MPI_Reduce_scatter and MPI_Allreduce as specified in the MPI standard. We also observe that the reduce-scatter algorithm can be used as a template for round-optimal all-to-all communication and the collective MPI_Alltoall operation.

Paper Structure

This paper contains 8 sections, 5 theorems, 4 equations, 1 figure, 2 algorithms.

Key Result

Theorem 1

On $p$ input vectors partitioned into $p$ blocks, Algorithm alg:blockreduction solves the reduce-scatter (partitioned all-reduce) problem in $\lceil \log_2 p\rceil$ send-receive communication rounds. Each processor sends and receives exactly $p-1$ partial result blocks of elements and performs exact

Figures (1)

  • Figure 1: The tree implicitly constructed by each processor by Algorithm \ref{['alg:blockreduction']} for $p=22$.

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • Theorem 2