Revisiting the Time Cost Model of AllReduce

Dian Xiong; Li Chen; Youhe Jiang; Dan Li; Shuai Wang; Songtao Wang

Revisiting the Time Cost Model of AllReduce

Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang

TL;DR

This work updates the standard AllReduce time-cost model by introducing GenModel, which adds memory-access δ and incast ε terms to the traditional αβγ cost, yielding a more accurate end-to-end cost with T = Aα + Bβ + Cγ + Dδ + max(w-w_t,0) B ε. It then leverages GenModel to develop GenTree, a heuristic plan generator for tree-like topologies, and proves that optimal plan generation on arbitrary topologies is NP-hard. Through extensive real-world and simulated evaluations, GenModel demonstrates markedly improved predictive accuracy (≈2.6% vs up to ≈19.8% for αβγ) and GenTree delivers significant speedups over NCCL and other state-of-the-art approaches across various network configurations and scales (up to 7.4x in simulations, 1.65x on GPUs, and 2.4x on CPUs). The practical impact is a more reliable framework for designing and selecting AllReduce algorithms tailored to modern clusters, with open-source tools for fitting GenModel and running simulations to reproduce results.

Abstract

AllReduce is an important and popular collective communication primitive, which has been widely used in areas such as distributed machine learning and high performance computing. To design, analyze, and choose from various algorithms and implementations of AllReduce, the time cost model plays a crucial role, and the predominant one is the $(α,β,γ)$ model. In this paper, we revisit this model, and reveal that it cannot well characterize the time cost of AllReduce on modern clusters; thus must be updated. We perform extensive measurements to identify two additional terms contributing to the time cost: the incast term and the memory access term. We augment the $(α,β,γ)$ model with these two terms, and present GenModel as a result. Using GenModel, we discover two new optimalities for AllReduce algorithms, and prove that they cannot be achieved simultaneously. Finally, striking the balance between the two new optimalities, we design GenTree, an AllReduce plan generation algorithm specialized for tree-like topologies. Experiments on a real testbed with 64 GPUs show that GenTree can achieve 1.22$\times$ to 1.65$\times$ speed-up against NCCL. Large-scale simulations also confirm that GenTree can improve the state-of-the-art AllReduce algorithm by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.

Revisiting the Time Cost Model of AllReduce

TL;DR

Abstract

model. In this paper, we revisit this model, and reveal that it cannot well characterize the time cost of AllReduce on modern clusters; thus must be updated. We perform extensive measurements to identify two additional terms contributing to the time cost: the incast term and the memory access term. We augment the

model with these two terms, and present GenModel as a result. Using GenModel, we discover two new optimalities for AllReduce algorithms, and prove that they cannot be achieved simultaneously. Finally, striking the balance between the two new optimalities, we design GenTree, an AllReduce plan generation algorithm specialized for tree-like topologies. Experiments on a real testbed with 64 GPUs show that GenTree can achieve 1.22

to 1.65

speed-up against NCCL. Large-scale simulations also confirm that GenTree can improve the state-of-the-art AllReduce algorithm by a factor of

in scenarios where the two new terms dominate.

Paper Structure (23 sections, 3 theorems, 14 equations, 11 figures, 7 tables, 2 algorithms)

This paper contains 23 sections, 3 theorems, 14 equations, 11 figures, 7 tables, 2 algorithms.

Introduction
Background
Types of AllReduce Plan
The $(\alpha,\beta,\gamma)$ Model
Motivation
GenModel: An Up-to-date AllReduce Time Cost Model
The Memory Access Term ($\delta$)
The Incast Term ($\varepsilon$)
GenModel and Its Implications
Incast Optimal
Memory Access Optimal
An Impossibility Result
Fitting GenModel to a New Cluster
GenTree: an AllReduce Plan Generation Algorithm
NP-Hardness of AllReduce Plan Generation
...and 8 more sections

Key Result

Theorem 1

The lower bound of memory access cost is One algorithm is memory access optimal if and only if its memory access cost is this value.

Figures (11)

Figure 1: AllReduce plan types
Figure 2: AllReduce in the view of the $(\alpha,\beta,\gamma)$ model: in each step, first launching the transmission, then transmitting the data, finally aggregating the received data.
Figure 3: PFC pause frames and extra communication overhead of $x$-to-$1$ communications with $x$ ranging from $6$ to $15$.
Figure 4: Average reduce overhead between every two vectors (${T_x}/(x-1)$) while processing $x$ 150M-float-vectors. The more vectors that are reduced at once, the faster each reduce operation will be.
Figure 5: Example of $6\times4$ Hierarchical Co-located PS. In the first step, servers form 6-server groups, and do ReduceScatter inside the groups. In the second step, servers form 4-server groups, and do ReduceScatter on the results of the previous step inside the groups. These two groupings are orthogonal.
...and 6 more figures

Theorems & Definitions (5)

Definition 1: $\varepsilon$-optimal
Definition 2: $\delta$-optimal
Theorem 1
Theorem 2
Theorem 3

Revisiting the Time Cost Model of AllReduce

TL;DR

Abstract

Revisiting the Time Cost Model of AllReduce

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (5)