SparDL: Distributed Deep Learning Training with Efficient Sparse Communication
Minjun Zhao, Yichen Yin, Yuren Mao, Qing Liu, Lu Chen, Yunjun Gao
TL;DR
SparDL tackles the inefficiency of sparse gradient synchronization under the SGA dilemma by introducing three novel components: Spar-Reduce-Scatter for efficient Reduce-Scatter of sparse gradients, Global Residual Collection to preserve discarded information and ensure fast convergence, and Spar-All-Gather to coordinate cross-team synchronization with adjustable latency-bandwidth trade-offs. By partitioning workers into $d$ teams and employing non-recursive, block-wise sparsification coupled with Bruck All-Gather within teams, SparDL achieves substantial reductions in communication time while maintaining comparable model accuracy across diverse tasks and networks. Empirical results show up to 4.9x speedups over state-of-the-art sparse all-reduce methods on image classification, NLP, and large-scale benchmarks, including ImageNet and Wikipedia with ResNet-50 and BERT, and even with RDMA networks. These gains translate into faster training times and improved scalability, making SparDL a practical solution for distributed sparse training in CV and NLP workloads.
Abstract
Top-k sparsification has recently been widely used to reduce the communication volume in distributed deep learning. However, due to the Sparse Gradient Accumulation (SGA) dilemma, the performance of top-k sparsification still has limitations. Recently, a few methods have been put forward to handle the SGA dilemma. Regrettably, even the state-of-the-art method suffers from several drawbacks, e.g., it relies on an inefficient communication algorithm and requires extra transmission steps. Motivated by the limitations of existing methods, we propose a novel efficient sparse communication framework, called SparDL. Specifically, SparDL uses the Spar-Reduce-Scatter algorithm, which is based on an efficient Reduce-Scatter model, to handle the SGA dilemma without additional communication operations. Besides, to further reduce the latency cost and improve the efficiency of SparDL, we propose the Spar-All-Gather algorithm. Moreover, we propose the global residual collection algorithm to ensure fast convergence of model training. Finally, extensive experiments are conducted to validate the superiority of SparDL.
