Round-optimal $n$-Block Broadcast Schedules in Logarithmic Time

Jesper Larsson Träff

Round-optimal $n$-Block Broadcast Schedules in Logarithmic Time

Jesper Larsson Träff

TL;DR

The paper tackles round-optimal $n$-block broadcast in fully connected, one-ported networks, proving that per-processor schedule computations can be done in $O(\log p)$ time with $O(\log p)$ space while achieving the optimal $n-1+\lceil \log_2 p\rceil$ rounds. It introduces an explicit, circulation-based communication pattern using $q=\lceil \log_2 p\rceil$ to realize both single-root broadcast and all-to-all variants, and provides $O(\log p)$-time algorithms to compute per-processor receive and send schedules without inter-processor communication. The receive schedule employs a greedy DFS over canonical skip sequences to produce $q$ distinct blocks across rounds, while the send schedule leverages a power-of-two construction and a general extension to non-power-of-two cases, ensuring correctness with only a constant number of violations. Empirical results on a $36\times 32$ cluster demonstrate substantial speedups over prior bounds, supporting practical MPI_Bcast and MPI_Allgatherv implementations and confirming the approach's scalability and applicability to real systems.

Abstract

We give optimally fast $O(\log p)$ time (per processor) algorithms for computing round-optimal broadcast schedules for message-passing parallel computing systems. This affirmatively answers the questions posed in Träff (2022). The problem is to broadcast $n$ indivisible blocks of data from a given root processor to all other processors in a (subgraph of a) fully connected network of $p$ processors with fully bidirectional, one-ported communication capabilities. In this model, $n-1+\lceil\log_2 p\rceil$ communication rounds are required. Our new algorithms compute for each processor in the network receive and send schedules each of size $\lceil\log_2 p\rceil$ that determine uniquely in $O(1)$ time for each communication round the new block that the processor will receive, and the already received block it has to send. Schedule computations are done independently per processor without communication. The broadcast communication subgraph is the same, easily computable, directed, $\lceil\log_2 p\rceil$-regular circulant graph used in Träff (2022) and elsewhere. We show how the schedule computations can be done in optimal time and space of $O(\log p)$, improving significantly over previous results of $O(p\log^2 p)$ and $O(\log^3 p)$. The schedule computation and broadcast algorithms are simple to implement, but correctness and complexity are not obvious. All algorithms have been implemented, compared to previous algorithms, and briefly evaluated on a small $36\times 32$ processor-core cluster.

Round-optimal $n$-Block Broadcast Schedules in Logarithmic Time

TL;DR

The paper tackles round-optimal

-block broadcast in fully connected, one-ported networks, proving that per-processor schedule computations can be done in

time with

space while achieving the optimal

rounds. It introduces an explicit, circulation-based communication pattern using

to realize both single-root broadcast and all-to-all variants, and provides

-time algorithms to compute per-processor receive and send schedules without inter-processor communication. The receive schedule employs a greedy DFS over canonical skip sequences to produce

distinct blocks across rounds, while the send schedule leverages a power-of-two construction and a general extension to non-power-of-two cases, ensuring correctness with only a constant number of violations. Empirical results on a

cluster demonstrate substantial speedups over prior bounds, supporting practical MPI_Bcast and MPI_Allgatherv implementations and confirming the approach's scalability and applicability to real systems.

Abstract

We give optimally fast

time (per processor) algorithms for computing round-optimal broadcast schedules for message-passing parallel computing systems. This affirmatively answers the questions posed in Träff (2022). The problem is to broadcast

indivisible blocks of data from a given root processor to all other processors in a (subgraph of a) fully connected network of

processors with fully bidirectional, one-ported communication capabilities. In this model,

communication rounds are required. Our new algorithms compute for each processor in the network receive and send schedules each of size

that determine uniquely in

time for each communication round the new block that the processor will receive, and the already received block it has to send. Schedule computations are done independently per processor without communication. The broadcast communication subgraph is the same, easily computable, directed,

-regular circulant graph used in Träff (2022) and elsewhere. We show how the schedule computations can be done in optimal time and space of

, improving significantly over previous results of

and

. The schedule computation and broadcast algorithms are simple to implement, but correctness and complexity are not obvious. All algorithms have been implemented, compared to previous algorithms, and briefly evaluated on a small

processor-core cluster.

Paper Structure (8 sections, 8 theorems, 1 equation, 3 figures, 3 tables, 9 algorithms)

This paper contains 8 sections, 8 theorems, 1 equation, 3 figures, 3 tables, 9 algorithms.

Introduction
Algorithms
Broadcast and all-to-all broadcast using schedules
The communication pattern
The receive schedule
The send schedule
Empirical Results
Summary

Key Result

Theorem 1

Let $K,K>0$ be a number of communication phases each consisting of $q$ communication rounds for a total of $Kq$ rounds. Assume that in each round $i, 0\leq i<Kq$, each processor $r,0\leq r<p$ receives a block $\mathtt{recvblock[}i\bmod q\mathtt{]}+\lfloor i/q\rfloor q$ and sends a block $\mathtt{sen

Figures (3)

Figure 1: Broadcast results, native versus new, with the OpenMPI 4.1.4 library with $p=36\times 32, p=36\times 4, p=36\times 1$ MPI processes. The constant factor $F$ for the size of the blocks has been chosen as $F=70$. The MPI datatype is MPI_INT.
Figure 2: Irregular allgather results, native versus new, with the OpenMPI 4.1.4 library with $p=36\times 32$ MPI processes and different types of input problems (regular, irregular, degenerate). The constant factor $G$ for the number of blocks has been chosen as $G=40$. The MPI datatype is MPI_INT.
Figure 3: Regular allgather results, native versus new, with the OpenMPI 4.1.4 library with $p=36\times 32, p=36\times 4, p=36\times 1$ MPI processes. The constant factor $G$ for the number of blocks has been chosen as $G=40$. The MPI datatype is MPI_INT.

Theorems & Definitions (15)

Theorem 1
proof
Lemma 1
proof
Proposition 1
proof
Lemma 2
proof
Proposition 2
proof
...and 5 more

Round-optimal $n$-Block Broadcast Schedules in Logarithmic Time

TL;DR

Abstract

Round-optimal $n$-Block Broadcast Schedules in Logarithmic Time

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (15)