Round-optimal $n$-Block Broadcast Schedules in Logarithmic Time
Jesper Larsson Träff
TL;DR
The paper tackles round-optimal $n$-block broadcast in fully connected, one-ported networks, proving that per-processor schedule computations can be done in $O(\log p)$ time with $O(\log p)$ space while achieving the optimal $n-1+\lceil \log_2 p\rceil$ rounds. It introduces an explicit, circulation-based communication pattern using $q=\lceil \log_2 p\rceil$ to realize both single-root broadcast and all-to-all variants, and provides $O(\log p)$-time algorithms to compute per-processor receive and send schedules without inter-processor communication. The receive schedule employs a greedy DFS over canonical skip sequences to produce $q$ distinct blocks across rounds, while the send schedule leverages a power-of-two construction and a general extension to non-power-of-two cases, ensuring correctness with only a constant number of violations. Empirical results on a $36\times 32$ cluster demonstrate substantial speedups over prior bounds, supporting practical MPI_Bcast and MPI_Allgatherv implementations and confirming the approach's scalability and applicability to real systems.
Abstract
We give optimally fast $O(\log p)$ time (per processor) algorithms for computing round-optimal broadcast schedules for message-passing parallel computing systems. This affirmatively answers the questions posed in Träff (2022). The problem is to broadcast $n$ indivisible blocks of data from a given root processor to all other processors in a (subgraph of a) fully connected network of $p$ processors with fully bidirectional, one-ported communication capabilities. In this model, $n-1+\lceil\log_2 p\rceil$ communication rounds are required. Our new algorithms compute for each processor in the network receive and send schedules each of size $\lceil\log_2 p\rceil$ that determine uniquely in $O(1)$ time for each communication round the new block that the processor will receive, and the already received block it has to send. Schedule computations are done independently per processor without communication. The broadcast communication subgraph is the same, easily computable, directed, $\lceil\log_2 p\rceil$-regular circulant graph used in Träff (2022) and elsewhere. We show how the schedule computations can be done in optimal time and space of $O(\log p)$, improving significantly over previous results of $O(p\log^2 p)$ and $O(\log^3 p)$. The schedule computation and broadcast algorithms are simple to implement, but correctness and complexity are not obvious. All algorithms have been implemented, compared to previous algorithms, and briefly evaluated on a small $36\times 32$ processor-core cluster.
