SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Minjun Zhao; Yichen Yin; Yuren Mao; Qing Liu; Lu Chen; Yunjun Gao

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Minjun Zhao, Yichen Yin, Yuren Mao, Qing Liu, Lu Chen, Yunjun Gao

TL;DR

SparDL tackles the inefficiency of sparse gradient synchronization under the SGA dilemma by introducing three novel components: Spar-Reduce-Scatter for efficient Reduce-Scatter of sparse gradients, Global Residual Collection to preserve discarded information and ensure fast convergence, and Spar-All-Gather to coordinate cross-team synchronization with adjustable latency-bandwidth trade-offs. By partitioning workers into $d$ teams and employing non-recursive, block-wise sparsification coupled with Bruck All-Gather within teams, SparDL achieves substantial reductions in communication time while maintaining comparable model accuracy across diverse tasks and networks. Empirical results show up to 4.9x speedups over state-of-the-art sparse all-reduce methods on image classification, NLP, and large-scale benchmarks, including ImageNet and Wikipedia with ResNet-50 and BERT, and even with RDMA networks. These gains translate into faster training times and improved scalability, making SparDL a practical solution for distributed sparse training in CV and NLP workloads.

Abstract

Top-k sparsification has recently been widely used to reduce the communication volume in distributed deep learning. However, due to the Sparse Gradient Accumulation (SGA) dilemma, the performance of top-k sparsification still has limitations. Recently, a few methods have been put forward to handle the SGA dilemma. Regrettably, even the state-of-the-art method suffers from several drawbacks, e.g., it relies on an inefficient communication algorithm and requires extra transmission steps. Motivated by the limitations of existing methods, we propose a novel efficient sparse communication framework, called SparDL. Specifically, SparDL uses the Spar-Reduce-Scatter algorithm, which is based on an efficient Reduce-Scatter model, to handle the SGA dilemma without additional communication operations. Besides, to further reduce the latency cost and improve the efficiency of SparDL, we propose the Spar-All-Gather algorithm. Moreover, we propose the global residual collection algorithm to ensure fast convergence of model training. Finally, extensive experiments are conducted to validate the superiority of SparDL.

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

TL;DR

teams and employing non-recursive, block-wise sparsification coupled with Bruck All-Gather within teams, SparDL achieves substantial reductions in communication time while maintaining comparable model accuracy across diverse tasks and networks. Empirical results show up to 4.9x speedups over state-of-the-art sparse all-reduce methods on image classification, NLP, and large-scale benchmarks, including ImageNet and Wikipedia with ResNet-50 and BERT, and even with RDMA networks. These gains translate into faster training times and improved scalability, making SparDL a practical solution for distributed sparse training in CV and NLP workloads.

Abstract

Paper Structure (24 sections, 1 theorem, 10 equations, 18 figures, 2 tables, 2 algorithms)

This paper contains 24 sections, 1 theorem, 10 equations, 18 figures, 2 tables, 2 algorithms.

Introduction
Background
State-of-the-art Methods
Our Solutions
Preliminaries
The Proposed SparDL
Overview
Spar-Reduce-Scatter Algorithm
Global Residual Collection Algorithm
Spar-All-Gather Algorithm
Experiments
Experimental Setup
Performance in Four Deep Learning Cases
Convergence in Four Deep Learning Cases
Comparison on large datasets: ImageNet and Wikipedia with ResNet-50 and BERT
...and 9 more sections

Key Result

Theorem 1

At each step $i$ in transmission, the ranks of blocks in sending bag from the $w$-th worker are a subset of those of the blocks held by the $w+2^{l-i}$-th worker.

Figures (18)

Figure 1: Illustration of the sparse gradient accumulation (SGA) dilemma
Figure 2: Illustration of All-Reduce, Reduce-Scatter and All-Gather operation
Figure 3: Illustration of recursive doubling and Bruck All-Gather
Figure 4: An overview of SparDL framework
Figure 5: Spar-Reduce-Scatter Algorithm. The number in the block represents the position of the block. Each block contains part of gradients (dense or sparse) of its corresponding position.
...and 13 more figures

Theorems & Definitions (4)

Example 1
Example 2
Theorem 1
proof

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

TL;DR

Abstract

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (4)