Table of Contents
Fetching ...

Sparse Communication for Distributed Gradient Descent

Alham Fikri Aji, Kenneth Heafield

TL;DR

The paper tackles the communication bottleneck in data-parallel distributed SGD by sending sparse gradient updates instead of dense ones. It introduces Gradient Dropping, which preserves 99% of updates locally and uses residuals to maintain convergence, with thresholding guided by layer normalization. Across MNIST and neural machine translation, the approach yields substantial speedups (up to 49% on MNIST and 22% on NMT) without harming final accuracy or BLEU in most configurations, though aggressive quantization can hurt complex tasks. The work highlights the importance of task structure (e.g., embedding-dominated NMT) in compression effectiveness and suggests avenues for further reduction in multi-node systems.

Abstract

We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and apply them to neural machine translation and MNIST image classification tasks. Most configurations work on MNIST, whereas different configurations reduce convergence rate on the more complex translation task. Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on NMT without damaging the final accuracy or BLEU.

Sparse Communication for Distributed Gradient Descent

TL;DR

The paper tackles the communication bottleneck in data-parallel distributed SGD by sending sparse gradient updates instead of dense ones. It introduces Gradient Dropping, which preserves 99% of updates locally and uses residuals to maintain convergence, with thresholding guided by layer normalization. Across MNIST and neural machine translation, the approach yields substantial speedups (up to 49% on MNIST and 22% on NMT) without harming final accuracy or BLEU in most configurations, though aggressive quantization can hurt complex tasks. The work highlights the importance of task structure (e.g., embedding-dominated NMT) in compression effectiveness and suggests avenues for further reduction in multi-node systems.

Abstract

We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and apply them to neural machine translation and MNIST image classification tasks. Most configurations work on MNIST, whereas different configurations reduce convergence rate on the more complex translation task. Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on NMT without damaging the final accuracy or BLEU.

Paper Structure

This paper contains 10 sections, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Distributed SGD architecture with parameter sharding.
  • Figure 2: MNIST: Training loss and accuracy for different dropping ratios.
  • Figure 3: NMT: Training loss and validation BLEU for different dropping ratios.
  • Figure 4: MNIST: Comparison of local and global thresholds with and without layer normalization.
  • Figure 5: NMT: Comparison of local and global thresholds with and without layer normalization.
  • ...and 3 more figures