Table of Contents
Fetching ...

Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance

Alexander Stollenwerk, Laurent Jacques

TL;DR

The paper introduces FO-SGD, a distributed SGD algorithm that compresses gradients in both worker-to-server and server-to-worker directions by combining a dithering-based one-bit quantizer with a randomized universal sensing basis to flatten gradients before quantization. This flattening mitigates variance blow-up and enables robust convergence guarantees even for sparse stochastic gradients, while remaining computationally efficient via the fast Walsh-Hadamard transform. The authors provide convergence guarantees for convex objectives and establish non-convex convergence rates to stationary points, with error terms controlled by quantization and transform parameters. The approach achieves full bidirectional compression and is well-suited for scalable distributed optimization in large-scale learning settings.

Abstract

We propose a novel algorithm for distributed stochastic gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit stochastic gradient descent (FO-SGD), relies on two simple algorithmic ideas: (i) a one-bit quantization procedure leveraging the technique of dithering, and (ii) a randomized fast Walsh-Hadamard transform to flatten the stochastic gradient before quantization. As a result, the approximation of the true gradient in this scheme is biased, but it prevents commonly encountered algorithmic problems, such as exploding variance in the one-bit compression regime, deterioration of performance in the case of sparse gradients, and restrictive assumptions on the distribution of the stochastic gradients. In fact, we show SGD-like convergence guarantees under mild conditions. The compression technique can be used in both directions of worker-server communication, therefore admitting distributed optimization with full communication compression.

Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance

TL;DR

The paper introduces FO-SGD, a distributed SGD algorithm that compresses gradients in both worker-to-server and server-to-worker directions by combining a dithering-based one-bit quantizer with a randomized universal sensing basis to flatten gradients before quantization. This flattening mitigates variance blow-up and enables robust convergence guarantees even for sparse stochastic gradients, while remaining computationally efficient via the fast Walsh-Hadamard transform. The authors provide convergence guarantees for convex objectives and establish non-convex convergence rates to stationary points, with error terms controlled by quantization and transform parameters. The approach achieves full bidirectional compression and is well-suited for scalable distributed optimization in large-scale learning settings.

Abstract

We propose a novel algorithm for distributed stochastic gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit stochastic gradient descent (FO-SGD), relies on two simple algorithmic ideas: (i) a one-bit quantization procedure leveraging the technique of dithering, and (ii) a randomized fast Walsh-Hadamard transform to flatten the stochastic gradient before quantization. As a result, the approximation of the true gradient in this scheme is biased, but it prevents commonly encountered algorithmic problems, such as exploding variance in the one-bit compression regime, deterioration of performance in the case of sparse gradients, and restrictive assumptions on the distribution of the stochastic gradients. In fact, we show SGD-like convergence guarantees under mild conditions. The compression technique can be used in both directions of worker-server communication, therefore admitting distributed optimization with full communication compression.
Paper Structure (10 sections, 14 theorems, 131 equations, 1 figure)

This paper contains 10 sections, 14 theorems, 131 equations, 1 figure.

Key Result

Lemma 3.6

Let $\boldsymbol H_\varepsilon\in \mathbb{R}^{d\times d}$ be a randomized universal sensing basis. For any $\boldsymbol x\in \mathbb{R}^d$ and $\alpha \geq 2$, with probability at least $1-2\exp(-\tfrac{1}{4}\alpha^2\log d)$.

Figures (1)

  • Figure 1:

Theorems & Definitions (35)

  • Definition 3.1
  • Example : Linear regression via least-squares estimation
  • Definition 3.2: $K$-averaged dithered one-bit quantizer
  • Remark 3.3
  • Definition 3.4
  • Definition 3.5
  • Lemma 3.6
  • Definition 3.7: Encoder
  • Definition 3.8: Decoder
  • Definition 5.1
  • ...and 25 more