Table of Contents
Fetching ...

On the Computation Rate of All-Reduce

Yufeng Zhou, Hua Sun

TL;DR

This paper provides a cut-set upper bound and a linear programming lower bound based on time (bandwidth) sharing over all schemes that first perform Reduce and then perform Broadcast and gives the optimal computation rate for a class of communication networks and the best-known rate bounds.

Abstract

In the All-Reduce problem, each one of the K nodes holds an input and wishes to compute the sum of all K inputs through a communication network where each pair of nodes is connected by a parallel link with arbitrary bandwidth. The computation rate of All-Reduce is defined as the number of sum instances that can be computed over each network use. For the computation rate, we provide a cut-set upper bound and a linear programming lower bound based on time (bandwidth) sharing over all schemes that first perform Reduce (aggregating all inputs at one node) and then perform Broadcast (sending the sum from that node to all other nodes). Specializing the two general bounds gives us the optimal computation rate for a class of communication networks and the best-known rate bounds (where the upper bound is no more than twice of the lower bound) for cyclic, complete, and hypercube networks.

On the Computation Rate of All-Reduce

TL;DR

This paper provides a cut-set upper bound and a linear programming lower bound based on time (bandwidth) sharing over all schemes that first perform Reduce and then perform Broadcast and gives the optimal computation rate for a class of communication networks and the best-known rate bounds.

Abstract

In the All-Reduce problem, each one of the K nodes holds an input and wishes to compute the sum of all K inputs through a communication network where each pair of nodes is connected by a parallel link with arbitrary bandwidth. The computation rate of All-Reduce is defined as the number of sum instances that can be computed over each network use. For the computation rate, we provide a cut-set upper bound and a linear programming lower bound based on time (bandwidth) sharing over all schemes that first perform Reduce (aggregating all inputs at one node) and then perform Broadcast (sending the sum from that node to all other nodes). Specializing the two general bounds gives us the optimal computation rate for a class of communication networks and the best-known rate bounds (where the upper bound is no more than twice of the lower bound) for cyclic, complete, and hypercube networks.
Paper Structure (11 sections, 8 theorems, 28 equations, 9 figures)

This paper contains 11 sections, 8 theorems, 28 equations, 9 figures.

Key Result

Theorem 1

For network $\mathcal{N}(\vec{\beta})$, the All-Reduce computation rate satisfies

Figures (9)

  • Figure 1: All $3^2 = 9$ rooted MAC tree networks $\mathcal{N}_{T_M} (\vec{\beta}_{T_M}, r)$ with $K = 3$ nodes and the root node is colored in red. For each network, the root node may compute $W_1+W_2+W_3$ by using the network $2$ times.
  • Figure 2: All $3^2 = 9$ rooted BC tree networks $\mathcal{N}_{T_B} (\vec{\beta}_{T_B}, r)$ with $K = 3$ nodes and the root node is colored in red. For each network, the root node may propagate $W_1+W_2+W_3$ to all other nodes by using the network $2$ times.
  • Figure 3: All $9$ rooted MAC-BC tree networks $\mathcal{N}_{T_{MB}} (\vec{\beta}_{T_{MB}}, r=1)$ with $K = 3$ nodes and root $r=1$ (colored in red). Each network may perform $L$ instances of sum computation with $N=L+3$ network uses where Reduce edges are in black and Broadcast edges are in blue, achieving rate $R = L/N \rightarrow 1$ as $L \rightarrow \infty$.
  • Figure 4: (b) is a $1$-MAC-BC network $\mathcal{N}_{MB}^1(\vec{\beta}_{MB}^1, 1 \rightarrow 2)$ with its associated rooted MAC-BC network in (d). (c) is a $1$-MAC-BC network $\mathcal{N}_{MB}^2(\vec{\beta}_{MB}^2, 1 \rightarrow 2)$ with its associated rooted MAC-BC network in (e). The cut-edge $1 \rightarrow 2$ is colored in red. (a) contains a network $\mathcal{N}(\vec{\beta})$ that is a linear combination of $\mathcal{N}_{MB}^1(\vec{\beta}_{MB}^1, 1 \rightarrow 2)$ and $\mathcal{N}_{MB}^2(\vec{\beta}_{MB}^2, 1 \rightarrow 2)$ considered in Theorem \ref{['thm:1macbc']} where $\vec{\beta} = 2 \vec{\beta}_{MB}^1 + \vec{\beta}_{MB}^2$.
  • Figure 5: (a). A network $\mathcal{N}(\vec{\beta})$ whose topology is a bi-directed tree and (b). its associated rooted MAC-BC network. The capacity of a bi-directed tree is equal to the minimum bandwidth of all links.
  • ...and 4 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Definition 1: Rooted MAC Tree Network
  • Definition 2: Rooted BC Tree Network
  • Definition 3: Rooted MAC-BC Network
  • Theorem 2
  • Corollary 2.1
  • Definition 4: 1-MAC-BC Network
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • ...and 3 more