Novel Gradient Sparsification Algorithm via Bayesian Inference

Ali Bereyhi; Ben Liang; Gary Boudreau; Ali Afana

Novel Gradient Sparsification Algorithm via Bayesian Inference

Ali Bereyhi, Ben Liang, Gary Boudreau, Ali Afana

TL;DR

A novel sparsification algorithm called regularized Top-k (REGTop-k) is proposed that controls the learning rate scaling of error accumulation and achieves about 8% higher accuracy than standard Top-k.

Abstract

Error accumulation is an essential component of the Top-$k$ sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate convergence. This paper proposes a novel sparsification algorithm called regularized Top-$k$ (RegTop-$k$) that controls the learning rate scaling of error accumulation. The algorithm is developed by looking at the gradient sparsification as an inference problem and determining a Bayesian optimal sparsification mask via maximum-a-posteriori estimation. It utilizes past aggregated gradients to evaluate posterior statistics, based on which it prioritizes the local gradient entries. Numerical experiments with ResNet-18 on CIFAR-10 show that at $0.1\%$ sparsification, RegTop-$k$ achieves about $8\%$ higher accuracy than standard Top-$k$.

Novel Gradient Sparsification Algorithm via Bayesian Inference

TL;DR

Abstract

Error accumulation is an essential component of the Top-

sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate convergence. This paper proposes a novel sparsification algorithm called regularized Top-

(RegTop-

) that controls the learning rate scaling of error accumulation. The algorithm is developed by looking at the gradient sparsification as an inference problem and determining a Bayesian optimal sparsification mask via maximum-a-posteriori estimation. It utilizes past aggregated gradients to evaluate posterior statistics, based on which it prioritizes the local gradient entries. Numerical experiments with ResNet-18 on CIFAR-10 show that at

sparsification, RegTop-

achieves about

higher accuracy than standard Top-

Paper Structure (22 sections, 2 theorems, 18 equations, 3 figures, 1 algorithm)

This paper contains 22 sections, 2 theorems, 18 equations, 3 figures, 1 algorithm.

Introduction
Error Accumulation and Learning Rate Scaling
A Motivational Example
Related Work
Contributions
Notation
Preliminaries
Top-$k$ Sparsification
Bayesian Gradient Sparsification
Bayesian-Optimal Sparsification
Statistical Global Top-$k$
Top-$k$ in Bayesian Framework
RegTop-$k$ Sparsification
RegTop-$k$ Algorithm
Discussions on Algorithm \ref{['alg:RegtopK']}
...and 7 more sections

Key Result

Proposition 1

The posterior $P _{ n \left[ j\right]} ^t$ is computed as where $\mathbb{F}_j^k = \left\lbrace {\boldsymbol{x}} \in \mathbb{R}^J: x_j \in \mathop{\mathrm{argmax}^k}_i x_i\right\rbrace$ with $x_i$ denoting the $i$-th entry of ${\boldsymbol{x}}$, and $q_n({{\mathbf{a}}^t})$ is

Figures (3)

Figure 1: Example of large learning rate scaling in Top-$k$.
Figure 2: RegTop-$k$ versus Top-$k$ sparsification for three sparsity factors. Left: $S=0.4$; middle: $S=0.5$; right: $S=0.6$.
Figure 3: ResNet-18 on CIFAR-10 with $0.1\%$ sparsification.

Theorems & Definitions (6)

Definition 1: Principle MAP Problem
Definition 2: Top-$k$ Prior Belief
Proposition 1
proof
Proposition 2
proof

Novel Gradient Sparsification Algorithm via Bayesian Inference

TL;DR

Abstract

Novel Gradient Sparsification Algorithm via Bayesian Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (6)