Table of Contents
Fetching ...

On Biased Compression for Distributed Learning

Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, Mher Safaryan

TL;DR

<3-5 sentence high-level summary>Biased gradient compression is analyzed as a tool to mitigate communication bottlenecks in distributed learning. The authors introduce three biased compressor classes, establish linear convergence with error feedback for both single-node and distributed SGD, and derive ergodic convergence rates that scale with the compression parameter $δ$. They demonstrate theoretical and empirical advantages of biased over unbiased compressors under statistical assumptions, and propose new biased compressors including Top-$k$ with dithering. Across extensive experiments, including large-scale transformer training, the work shows substantial communication savings with minimal loss in performance, highlighting practical impact for distributed optimization systems.

Abstract

In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( δL \exp \left[-\frac{μK}{δL}\right] + \frac{(C + δD)}{Kμ}\right)$, where $δ\ge 1$ is a compression parameter which grows when more compression is applied, $L$ and $μ$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.

On Biased Compression for Distributed Learning

TL;DR

<3-5 sentence high-level summary>Biased gradient compression is analyzed as a tool to mitigate communication bottlenecks in distributed learning. The authors introduce three biased compressor classes, establish linear convergence with error feedback for both single-node and distributed SGD, and derive ergodic convergence rates that scale with the compression parameter . They demonstrate theoretical and empirical advantages of biased over unbiased compressors under statistical assumptions, and propose new biased compressors including Top- with dithering. Across extensive experiments, including large-scale transformer training, the work shows substantial communication savings with minimal loss in performance, highlighting practical impact for distributed optimization systems.

Abstract

In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate , where is a compression parameter which grows when more compression is applied, and are the smoothness and strong convexity constants, captures stochastic gradient noise ( if full gradients are computed on each node) and captures the variance of the gradients at the optimum ( for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.

Paper Structure

This paper contains 51 sections, 17 theorems, 152 equations, 9 figures, 3 tables, 1 algorithm.

Key Result

Lemma 3

For any $x\in \mathbb{R}^d$, if ${{\rm E}}\left[ \left\| {\cal C}(x) \right\|_2^2 \right] \leq \beta \langle {{\rm E}}\left[{\cal C}(x)\right] , x \rangle$, then

Figures (9)

  • Figure 1: The comparison of Top-$k$ and Rand-$k$ sparsifiers with respect to normalized variance and the number of encoding bits used for each coordinate on average. Each point/marker represents a single $d=10^4$ dimensional vector drawn form Gaussian distribution and then compressed by the specified operator. Each curve was obtained by varying the free parameter $k\in\{1,2,\dots,d\}$. Plots for different $d$ look very similar. Notice that, for random sparsification the normalized variance is perfectly linear with respect to the number of bit per coordinate. Letting $b$ be the total number of bits to encode the compressed vector (say in binary32 system), the normalized variance produced by random sparsifier is almost $1-\tfrac{b/d}{32}$. However, greedy sparsifier achieves exponentially lower variance $\approx 0.86^{b/d}$ utilizing the same amount of bits.
  • Figure 2: Calculations of the Rand-5 and Top-5 energy "saving" for practical gradient distributions ((a),(b),(c): quadratic problem, (d): logistic regression). The results of Top-5 are 3--5$\times$ better.
  • Figure 3: Comparison of various compressors (with and without free design parameters) with respect to the parameter $\delta\ge1$ in $\log_{10}-$scale and the number of encoding bits used for each coordinate on average. Each point/marker represents a single $d=10^4$ dimensional vector $x$ drawn from Gaussian distribution and then compressed by the specified operator. Each curve was obtained by varying free parameters ($k\in\{1,2,\dots,d\}$ for sparsifiers, $s\ge1$ for ditherings) of the specified compressor. Parameter-free compressors, such as ternary quantization and natural compression, have fixed communication budgets which explains the vertical arrangements of the points.
  • Figure 4: Training/Test loss and accuracy for VGG19 on CIFAR10 distributed among $4$ nodes for $4$ different compression operators.
  • Figure 5: Training loss and test accuracy for VGG11 on CIFAR10 distributed among $4$ nodes for $5$ different compression operators.
  • ...and 4 more figures

Theorems & Definitions (25)

  • Definition 1
  • Definition 2
  • Lemma 3
  • Definition 4
  • Definition 5
  • Theorem 6: Equivalence between biased compressors
  • Theorem 7: From unbiased to biased with scaling
  • Theorem 17
  • Theorem 18
  • Theorem 19
  • ...and 15 more