Optimal Broadcast Schedules in Logarithmic Time with Applications to Broadcast, All-Broadcast, Reduction and All-Reduction

Jesper Larsson Träff

Optimal Broadcast Schedules in Logarithmic Time with Applications to Broadcast, All-Broadcast, Reduction and All-Reduction

Jesper Larsson Träff

TL;DR

The paper presents algorithms that compute round-optimal broadcast schedules in $O(\log p)$ time per processor for $p$ processors transmitting $n$ blocks, achieving the lower bound of $n-1+\lceil \log_2 p\rceil$ rounds using a symmetric circulant communication graph of degree $q=\lceil \log_2 p\rceil$. It derives explicit receive and send schedules, with $\text{recvblock}[k]_r$ computed in $O(q)$ and $\text{sendblock}[k]_r$ in $O(\log p)$ time, enabling deterministic block routing without inter-processor communication during preprocessing. These schedules extend to efficient implementations of related MPI collectives (MPI_Bcast, MPI_Allgatherv, MPI_Reduce, MPI_Reduce_scatter) and solve open questions about round-optimality in irregular and collective operations. Empirical results show promising speedups over native MPI libraries on large-scale systems, validating the practical value of the approach and motivating further optimization for hierarchical architectures.

Abstract

We give optimally fast $O(\log p)$ time (per processor) algorithms for computing round-optimal broadcast schedules for message-passing parallel computing systems. This affirmatively answers difficult questions posed in a SPAA 2022 BA and a CLUSTER 2022 paper. We observe that the computed schedules and circulant communication graph can likewise be used for reduction, all-broadcast and all-reduction as well, leading to new, round-optimal algorithms for these problems. These observations affirmatively answer open questions posed in a CLUSTER 2023 paper. The problem is to broadcast $n$ indivisible blocks of data from a given root processor to all other processors in a (subgraph of a) fully connected network of $p$ processors with fully bidirectional, one-ported communication capabilities. In this model, $n-1+\lceil\log_2 p\rceil$ communication rounds are required. Our new algorithms compute for each processor in the network receive and send schedules each of size $\lceil\log_2 p\rceil$ that determine uniquely in $O(1)$ time for each communication round the new block that the processor will receive, and the already received block it has to send. Schedule computations are done independently per processor without communication. The broadcast communication subgraph is an easily computable, directed, $\lceil\log_2 p\rceil$-regular circulant graph also used elsewhere. We show how the schedule computations can be done in optimal time and space of $O(\log p)$, improving significantly over previous results of $O(p\log^2 p)$ and $O(\log^3 p)$, respectively. The schedule computation and broadcast algorithms are simple to implement, but correctness and complexity are not obvious. The schedules are used for new implementations of the MPI (Message-Passing Interface) collectives MPI_Bcast, MPI_Allgatherv, MPI_Reduce and MPI_Reduce_scatter. Preliminary experimental results are given.

Optimal Broadcast Schedules in Logarithmic Time with Applications to Broadcast, All-Broadcast, Reduction and All-Reduction

TL;DR

The paper presents algorithms that compute round-optimal broadcast schedules in

time per processor for

processors transmitting

blocks, achieving the lower bound of

rounds using a symmetric circulant communication graph of degree

. It derives explicit receive and send schedules, with

computed in

and

time, enabling deterministic block routing without inter-processor communication during preprocessing. These schedules extend to efficient implementations of related MPI collectives (MPI_Bcast, MPI_Allgatherv, MPI_Reduce, MPI_Reduce_scatter) and solve open questions about round-optimality in irregular and collective operations. Empirical results show promising speedups over native MPI libraries on large-scale systems, validating the practical value of the approach and motivating further optimization for hierarchical architectures.

Abstract

We give optimally fast

time (per processor) algorithms for computing round-optimal broadcast schedules for message-passing parallel computing systems. This affirmatively answers difficult questions posed in a SPAA 2022 BA and a CLUSTER 2022 paper. We observe that the computed schedules and circulant communication graph can likewise be used for reduction, all-broadcast and all-reduction as well, leading to new, round-optimal algorithms for these problems. These observations affirmatively answer open questions posed in a CLUSTER 2023 paper. The problem is to broadcast

indivisible blocks of data from a given root processor to all other processors in a (subgraph of a) fully connected network of

processors with fully bidirectional, one-ported communication capabilities. In this model,

communication rounds are required. Our new algorithms compute for each processor in the network receive and send schedules each of size

that determine uniquely in

time for each communication round the new block that the processor will receive, and the already received block it has to send. Schedule computations are done independently per processor without communication. The broadcast communication subgraph is an easily computable, directed,

-regular circulant graph also used elsewhere. We show how the schedule computations can be done in optimal time and space of

, improving significantly over previous results of

and

, respectively. The schedule computation and broadcast algorithms are simple to implement, but correctness and complexity are not obvious. The schedules are used for new implementations of the MPI (Message-Passing Interface) collectives MPI_Bcast, MPI_Allgatherv, MPI_Reduce and MPI_Reduce_scatter. Preliminary experimental results are given.

Paper Structure (9 sections, 10 theorems, 4 equations, 2 figures, 4 tables, 7 algorithms)

This paper contains 9 sections, 10 theorems, 4 equations, 2 figures, 4 tables, 7 algorithms.

Introduction
Broadcast(ing with) Schedules and Circulant Graphs
The communication pattern
The receive schedule
The send schedule
Empirical Results
Summary
All-Broadcast
Overheads

Key Result

Theorem 1

Let $K,K>0$ be a number of communication phases each consisting of $q$ communication rounds for a total of $Kq$ rounds. Assume that in each round $i, 0\leq i<Kq$, each processor $r,0\leq r<p$ receives a block $\mathtt{recvblock[}i\bmod q\mathtt{]}_r+\lfloor i/q\rfloor q$ and sends a block $\mathtt{s

Figures (2)

Figure 1: Broadcast and reduce results, native versus new, with the OpenMPI 4.1.5 library with $p=200\times 1, p=200\times 4$ and $p=200\times 128$ MPI processes. The constant factor $F$ for the size of the blocks has been chosen as $F=70$. The MPI datatype is MPI_INT.
Figure 2: Irregular all-broadcast (MPI_-Allgatherv) results, native versus new, with the OpenMPI 4.0.5 library with $p=36\times 32$ MPI processes and different types of input problems (regular, irregular, degenerate). The constant factor $G$ for the number of blocks has been chosen as $G=40$. The MPI datatype is MPI_INT.

Theorems & Definitions (19)

Theorem 1
proof
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
Lemma 4
proof
...and 9 more

Optimal Broadcast Schedules in Logarithmic Time with Applications to Broadcast, All-Broadcast, Reduction and All-Reduction

TL;DR

Abstract

Optimal Broadcast Schedules in Logarithmic Time with Applications to Broadcast, All-Broadcast, Reduction and All-Reduction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (19)