DeFT: Mitigating Data Dependencies for Flexible Communication Scheduling in Distributed Training
Lin Meng, Yuzhong Sun
TL;DR
DeFT addresses bottlenecks in data-parallel distributed training by mitigating hard data dependencies and enabling flexible scheduling. It reframes the forward and backward communication scheduling as two 0/1 knapsack problems, supplemented by delayed updates, adaptive update frequencies, and heterogeneous multi-link communication to maximize overlap while preserving convergence through a convergence-preserving Preserver. Implemented in PyTorch with Profiler, Solver, and Preserver modules, DeFT demonstrates 29–115% speedups over baselines on 16 A100 GPUs across ResNet-101, VGG-19, and GPT-2, with negligible accuracy loss. The work contributes a practical framework to reduce bubbles in the compute stream, balance bucket times, and adaptively manage communication resources for scalable data-parallel training.
Abstract
Communication scheduling aims to reduce communication bottlenecks in data parallel training (DP) by maximizing the overlap between computation and communication. However, existing schemes fall short due to three main issues: (1) hard data dependencies break some overlapping between communication and computation; (2) high coverage rates impair further improvement on performance; (3) imbalanced communication/computation times of tensors caused by partitioning/fusion strategies cause more bubbles. To address these drawbacks, we propose a new communication scheduling scheme DeFT, whose key insight is to mitigate data dependencies and support flexible scheduling in distributed training. DeFT uncovers new overlapping chances in training by transforming the scheduling problem into multiple knapsack problems. Specifically, DeFT eliminates hard dependencies with delayed updates, reducing the coverage rate by adjusting update frequency and utilizing heterogeneous communication links, merging the computation times of backward or forward as the knapsack capacity to avoid the negative impact of unbalanced tensors. Additionally, DeFT preserves training accuracy by adjusting its scheduling strategy via convergence loss quantification. Extensive experiments with 16 A100 GPUs showed that DeFT achieved speedups of 29% to 115% on three representative benchmarks compared to US-Byte and Bytescheduler with no loss of accuracy.
