Table of Contents
Fetching ...

From promise to practice: realizing high-performance decentralized training

Zesen Wang, Jiaojiao Zhang, Xuyang Wu, Mikael Johansson

TL;DR

This paper identifies three key factors that can lead to speedups over All-Reduce training and constructs a runtime model to determine when, how, and to what degree decentralization can yield shorter per-iteration runtimes.

Abstract

Decentralized training of deep neural networks has attracted significant attention for its theoretically superior scalability over synchronous data-parallel methods like All-Reduce. However, realizing this potential in multi-node training is challenging due to the complex design space that involves communication topologies, computation patterns, and optimization algorithms. This paper identifies three key factors that can lead to speedups over All-Reduce training and constructs a runtime model to determine when, how, and to what degree decentralization can yield shorter per-iteration runtimes. Furthermore, to support the decentralized training of transformer-based models, we study a decentralized Adam algorithm that allows for overlapping communications and computations, prove its convergence, and propose an accumulation technique to mitigate the high variance caused by small local batch sizes. We deploy the proposed approach in clusters with up to 64 GPUs and demonstrate its practicality and advantages in both runtime and generalization performance under a fixed iteration budget.

From promise to practice: realizing high-performance decentralized training

TL;DR

This paper identifies three key factors that can lead to speedups over All-Reduce training and constructs a runtime model to determine when, how, and to what degree decentralization can yield shorter per-iteration runtimes.

Abstract

Decentralized training of deep neural networks has attracted significant attention for its theoretically superior scalability over synchronous data-parallel methods like All-Reduce. However, realizing this potential in multi-node training is challenging due to the complex design space that involves communication topologies, computation patterns, and optimization algorithms. This paper identifies three key factors that can lead to speedups over All-Reduce training and constructs a runtime model to determine when, how, and to what degree decentralization can yield shorter per-iteration runtimes. Furthermore, to support the decentralized training of transformer-based models, we study a decentralized Adam algorithm that allows for overlapping communications and computations, prove its convergence, and propose an accumulation technique to mitigate the high variance caused by small local batch sizes. We deploy the proposed approach in clusters with up to 64 GPUs and demonstrate its practicality and advantages in both runtime and generalization performance under a fixed iteration budget.

Paper Structure

This paper contains 60 sections, 2 theorems, 126 equations, 15 figures, 5 tables, 3 algorithms.

Key Result

Theorem 4.1

Under Assumptions assump-Lips--assump-W, if $0<\beta_1<\beta_2<1$, for Algorithm alg:decentadam, we have where $\bar{x}^{(t)}=\frac{1}{N}\sum_{i=1}^{N} x_i^{(t)}$, $\tau$ is defined in equation eq-tau, $\tilde{T}=T-\frac{\beta_1}{1-\beta_1}$, $F_{*}$ is the optimal value of equation eq-prob-denc, and $E=\frac{24 D R^2\sqrt{1-\beta_1}}{\sqrt{1-\beta_2} (1-\beta_1/\beta_2)^{3/2}}+\frac{2 \alpha DLR

Figures (15)

  • Figure 1: Timelines and dependency relations of decentralized training. $F$: forward pass, $B$: backward pass, $C$: decentralized communication/aggregation of model parameters with other workers, $U$: update model parameters. An arrow from task $X$ to $Y$ means that task $Y$ can start only after $X$ finishes.
  • Figure 2: Comparison of consensus errors by communication rounds with four 4-GPU nodes.
  • Figure 3: Distributions of normalized computation times of ResNet-50 training for an image classification task and of transformer training for neural machine translation. The variances of the normalized computation time for the image task and the translation task are 0.0017 and 0.0134, respectively.
  • Figure 4: Comparison of the predicted runtime by runtime model and the actual runtime for translation task based on transformer model. The lines are predicted runtime based on the runtime model with scaling to the real runtime in milliseconds. The scatters are the measured runtime.
  • Figure 5: Simulation by the runtime model. In the simulation, $N=8$ (8 workers), $b=4$ (4 buckets, from ResNet-50 experiments), $\theta=0.2$ (time taken by updating one bucket is 0.2 unit time), and the communication topology of decentralized training is complete topology.
  • ...and 10 more figures

Theorems & Definitions (10)

  • Theorem 4.1
  • Theorem B.1: Without Heavy-ball Momentum
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof