Table of Contents
Fetching ...

DeFT: Mitigating Data Dependencies for Flexible Communication Scheduling in Distributed Training

Lin Meng, Yuzhong Sun

TL;DR

DeFT addresses bottlenecks in data-parallel distributed training by mitigating hard data dependencies and enabling flexible scheduling. It reframes the forward and backward communication scheduling as two 0/1 knapsack problems, supplemented by delayed updates, adaptive update frequencies, and heterogeneous multi-link communication to maximize overlap while preserving convergence through a convergence-preserving Preserver. Implemented in PyTorch with Profiler, Solver, and Preserver modules, DeFT demonstrates 29–115% speedups over baselines on 16 A100 GPUs across ResNet-101, VGG-19, and GPT-2, with negligible accuracy loss. The work contributes a practical framework to reduce bubbles in the compute stream, balance bucket times, and adaptively manage communication resources for scalable data-parallel training.

Abstract

Communication scheduling aims to reduce communication bottlenecks in data parallel training (DP) by maximizing the overlap between computation and communication. However, existing schemes fall short due to three main issues: (1) hard data dependencies break some overlapping between communication and computation; (2) high coverage rates impair further improvement on performance; (3) imbalanced communication/computation times of tensors caused by partitioning/fusion strategies cause more bubbles. To address these drawbacks, we propose a new communication scheduling scheme DeFT, whose key insight is to mitigate data dependencies and support flexible scheduling in distributed training. DeFT uncovers new overlapping chances in training by transforming the scheduling problem into multiple knapsack problems. Specifically, DeFT eliminates hard dependencies with delayed updates, reducing the coverage rate by adjusting update frequency and utilizing heterogeneous communication links, merging the computation times of backward or forward as the knapsack capacity to avoid the negative impact of unbalanced tensors. Additionally, DeFT preserves training accuracy by adjusting its scheduling strategy via convergence loss quantification. Extensive experiments with 16 A100 GPUs showed that DeFT achieved speedups of 29% to 115% on three representative benchmarks compared to US-Byte and Bytescheduler with no loss of accuracy.

DeFT: Mitigating Data Dependencies for Flexible Communication Scheduling in Distributed Training

TL;DR

DeFT addresses bottlenecks in data-parallel distributed training by mitigating hard data dependencies and enabling flexible scheduling. It reframes the forward and backward communication scheduling as two 0/1 knapsack problems, supplemented by delayed updates, adaptive update frequencies, and heterogeneous multi-link communication to maximize overlap while preserving convergence through a convergence-preserving Preserver. Implemented in PyTorch with Profiler, Solver, and Preserver modules, DeFT demonstrates 29–115% speedups over baselines on 16 A100 GPUs across ResNet-101, VGG-19, and GPT-2, with negligible accuracy loss. The work contributes a practical framework to reduce bubbles in the compute stream, balance bucket times, and adaptively manage communication resources for scalable data-parallel training.

Abstract

Communication scheduling aims to reduce communication bottlenecks in data parallel training (DP) by maximizing the overlap between computation and communication. However, existing schemes fall short due to three main issues: (1) hard data dependencies break some overlapping between communication and computation; (2) high coverage rates impair further improvement on performance; (3) imbalanced communication/computation times of tensors caused by partitioning/fusion strategies cause more bubbles. To address these drawbacks, we propose a new communication scheduling scheme DeFT, whose key insight is to mitigate data dependencies and support flexible scheduling in distributed training. DeFT uncovers new overlapping chances in training by transforming the scheduling problem into multiple knapsack problems. Specifically, DeFT eliminates hard dependencies with delayed updates, reducing the coverage rate by adjusting update frequency and utilizing heterogeneous communication links, merging the computation times of backward or forward as the knapsack capacity to avoid the negative impact of unbalanced tensors. Additionally, DeFT preserves training accuracy by adjusting its scheduling strategy via convergence loss quantification. Extensive experiments with 16 A100 GPUs showed that DeFT achieved speedups of 29% to 115% on three representative benchmarks compared to US-Byte and Bytescheduler with no loss of accuracy.

Paper Structure

This paper contains 30 sections, 8 equations, 18 figures, 6 tables, 1 algorithm.

Figures (18)

  • Figure 1: Three problems that cannot be solved by current communication scheduling schemes. In (a), communications/computations with hard dependencies are unable to be parallelized with the other party. In (b), communication bottlenecks cause gaps to optimal performance. In (c), imbalance in computation/communications causes wasted overlapping opportunities or bubbles.
  • Figure 2: An example of training a 7-layer network in WFBP w/ and w/o Tensor Fusion. The communication overhead with tensor fusion is lower due to the less times of startup delays.
  • Figure 3: Difference in three communication scheduling schemes. Priority schemes utilize forward computation to increase overlapping, while non-sequential schemes have better tensor communication order and lower total communication overhead.
  • Figure 4: The process of how current task and future task queue change. Buckets with yellow, green and blue colors represents unsynchronized gradient buckets from different iterations. In fifth iteration, green buckets in future task queue are merged with the new buckets of the fifth iteration (not shown in figure).
  • Figure 5: An example of concurrent heterogeneous communication. The communication of bucket #7 is scheduled to the heterogeneous link in the backward of last iteration.
  • ...and 13 more figures