Table of Contents
Fetching ...

Distributed Learning with Compressed Gradient Differences

Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, Peter Richtárik

TL;DR

The paper tackles the communication bottleneck in distributed training by introducing DIANA, a method that compresses gradient differences rather than gradients themselves and augments them with node memories to learn the true gradients. It provides rigorous convergence analyses for both strongly convex and nonconvex settings, including non-smooth regularizers and block quantization, and shows that learning the gradient at the optimum is possible via gradient-difference compression. The authors extend the theory to TernGrad and QSGD, derive optimal quantization strategies, and demonstrate practical benefits through extensive experiments, including multiple datasets and MPI/GPU implementations. Collectively, the work delivers both theoretical guarantees and actionable guidance for deploying compressed-gradient distributed optimization at scale.

Abstract

Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which renders them incapable of converging to the true optimum in the batch mode. In this work we propose a new distributed learning method -- DIANA -- which resolves this issue via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are superior to existing rates. We also provide theory to support non-smooth regularizers study the difference between quantization schemes. Our analysis of block-quantization and differences between $\ell_2$ and $\ell_{\infty}$ quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to TernGrad, we establish the first convergence rate for this method.

Distributed Learning with Compressed Gradient Differences

TL;DR

The paper tackles the communication bottleneck in distributed training by introducing DIANA, a method that compresses gradient differences rather than gradients themselves and augments them with node memories to learn the true gradients. It provides rigorous convergence analyses for both strongly convex and nonconvex settings, including non-smooth regularizers and block quantization, and shows that learning the gradient at the optimum is possible via gradient-difference compression. The authors extend the theory to TernGrad and QSGD, derive optimal quantization strategies, and demonstrate practical benefits through extensive experiments, including multiple datasets and MPI/GPU implementations. Collectively, the work delivers both theoretical guarantees and actionable guidance for deploying compressed-gradient distributed optimization at scale.

Abstract

Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which renders them incapable of converging to the true optimum in the batch mode. In this work we propose a new distributed learning method -- DIANA -- which resolves this issue via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are superior to existing rates. We also provide theory to support non-smooth regularizers study the difference between quantization schemes. Our analysis of block-quantization and differences between and quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to TernGrad, we establish the first convergence rate for this method.

Paper Structure

This paper contains 36 sections, 31 theorems, 60 equations, 14 figures, 6 tables, 2 algorithms.

Key Result

Theorem 3.3

Let $0\neq \Delta\in \mathbb{R}^{\tilde{d}}$ and ${\widetilde{\Delta}} \sim {\rm Quant}_{p}(\Delta)$ be its $p$-quantization. Then All expressions in eq:expected_comm_cost and eq:expected_comm_cost2 are increasing functions of $p$.

Figures (14)

  • Figure 1: Comparison of the DIANA ($\beta = 0.95$) with QSGD, TernGrad and DQGD on the logistic regression problem for the "mushrooms" dataset.
  • Figure 2: Comparison of performance (images/second) for various number of GPUs/MPI processes and sparse communication DIANA (2bit) vs. Reduce with 32bit float (FP32).
  • Figure 3: Evolution of training (left) and testing (right) accuracy on Cifar10, using 4 algorithms: DIANA, SGD, QSGD and TernGrad. We have chosen the best runs over all tested hyper-parameters.
  • Figure 4: Typical communication cost using broadcast, reduce and gather for 64 and 32 FP using 4 (solid) resp 128 (dashed) MPI processes. See Section \ref{['sec:A:MPI']} for details about the network.
  • Figure M1: Illustration of the workings of DIANA, QSGD and TernGrad on the Rosenbrock function.
  • ...and 9 more figures

Theorems & Definitions (33)

  • Definition 3.2: $p$-quantization
  • Theorem 3.3: Expected sparsity
  • Lemma 4.3
  • Theorem 4.4
  • Corollary 4.5
  • Theorem 4.6
  • Corollary 4.7
  • Theorem 5.2
  • Corollary 5.3
  • Definition B.1: block-$p$-quantization
  • ...and 23 more