Temporal Predictive Coding for Gradient Compression in Distributed Learning

Adrian Edin; Zheng Chen; Michel Kieffer; Mikael Johansson

Temporal Predictive Coding for Gradient Compression in Distributed Learning

Adrian Edin, Zheng Chen, Michel Kieffer, Mikael Johansson

TL;DR

This paper proposes a prediction-based gradient compression method for distributed learning with event-triggered communication that uses a linear predictor that combines past gradients to form a prediction of the current gradient, with coefficients that are optimized by solving a least-square problem.

Abstract

This paper proposes a prediction-based gradient compression method for distributed learning with event-triggered communication. Our goal is to reduce the amount of information transmitted from the distributed agents to the parameter server by exploiting temporal correlation in the local gradients. We use a linear predictor that \textit{combines past gradients to form a prediction of the current gradient}, with coefficients that are optimized by solving a least-square problem. In each iteration, every agent transmits the predictor coefficients to the server such that the predicted local gradient can be computed. The difference between the true local gradient and the predicted one, termed the \textit{prediction residual, is only transmitted when its norm is above some threshold.} When this additional communication step is omitted, the server uses the prediction as the estimated gradient. This proposed design shows notable performance gains compared to existing methods in the literature, achieving convergence with reduced communication costs.

Temporal Predictive Coding for Gradient Compression in Distributed Learning

TL;DR

Abstract

Paper Structure (13 sections, 3 theorems, 31 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 3 theorems, 31 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
System Model
Gradient Compression with Predictive Coding
Linear Prediction using Least Square Estimator
Event-Triggered Residual Transmission
Comparison to State-of-the-Art Methods
Convergence Analysis of the Proposed Algorithm
First and Second Moment Limits
Convergence Analysis
Simulation Results
Conclusions
Proof of Lemma \ref{['lemma:first moment']} (First Moment Bound)
Proof of Lemma \ref{['lemma:variance bound']} (Variance Bound)

Key Result

Lemma 4.1

Given as:bounded threshold, we can bound the first moment of $\widetilde{\bm{g}}*{t}$ by

Figures (3)

Figure 1: Block diagram of the prediction-based gradient compression design for agent $k$. The "compression" block performs event-triggered transmission and classical compression. The prediction coefficients $\bm{a}_k*{t}$ are assumed transmitted with negligible distortion.
Figure 2: Training performance, using on average $R=6$ bits per residual element transmission.
Figure 3: Training performance, using on average $R=3$ bits per residual element transmission.

Theorems & Definitions (12)

Remark 3.1
Definition 4.1: Unbiased Random Compressor
Lemma 4.1: First Moment Bound
proof
Lemma 4.2: Variance Bound
proof
Remark 4.1
Remark 4.2
Theorem 4.3
proof
...and 2 more

Temporal Predictive Coding for Gradient Compression in Distributed Learning

TL;DR

Abstract

Temporal Predictive Coding for Gradient Compression in Distributed Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (12)