Table of Contents
Fetching ...

Novel Gradient Sparsification Algorithm via Bayesian Inference

Ali Bereyhi, Ben Liang, Gary Boudreau, Ali Afana

TL;DR

A novel sparsification algorithm called regularized Top-k (REGTop-k) is proposed that controls the learning rate scaling of error accumulation and achieves about 8% higher accuracy than standard Top-k.

Abstract

Error accumulation is an essential component of the Top-$k$ sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate convergence. This paper proposes a novel sparsification algorithm called regularized Top-$k$ (RegTop-$k$) that controls the learning rate scaling of error accumulation. The algorithm is developed by looking at the gradient sparsification as an inference problem and determining a Bayesian optimal sparsification mask via maximum-a-posteriori estimation. It utilizes past aggregated gradients to evaluate posterior statistics, based on which it prioritizes the local gradient entries. Numerical experiments with ResNet-18 on CIFAR-10 show that at $0.1\%$ sparsification, RegTop-$k$ achieves about $8\%$ higher accuracy than standard Top-$k$.

Novel Gradient Sparsification Algorithm via Bayesian Inference

TL;DR

A novel sparsification algorithm called regularized Top-k (REGTop-k) is proposed that controls the learning rate scaling of error accumulation and achieves about 8% higher accuracy than standard Top-k.

Abstract

Error accumulation is an essential component of the Top- sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate convergence. This paper proposes a novel sparsification algorithm called regularized Top- (RegTop-) that controls the learning rate scaling of error accumulation. The algorithm is developed by looking at the gradient sparsification as an inference problem and determining a Bayesian optimal sparsification mask via maximum-a-posteriori estimation. It utilizes past aggregated gradients to evaluate posterior statistics, based on which it prioritizes the local gradient entries. Numerical experiments with ResNet-18 on CIFAR-10 show that at sparsification, RegTop- achieves about higher accuracy than standard Top-.
Paper Structure (22 sections, 2 theorems, 18 equations, 3 figures, 1 algorithm)

This paper contains 22 sections, 2 theorems, 18 equations, 3 figures, 1 algorithm.

Key Result

Proposition 1

The posterior $P _{ n \left[ j\right]} ^t$ is computed as where $\mathbb{F}_j^k = \left\lbrace {\boldsymbol{x}} \in \mathbb{R}^J: x_j \in \mathop{\mathrm{argmax}^k}_i x_i\right\rbrace$ with $x_i$ denoting the $i$-th entry of ${\boldsymbol{x}}$, and $q_n({{\mathbf{a}}^t})$ is

Figures (3)

  • Figure 1: Example of large learning rate scaling in Top-$k$.
  • Figure 2: RegTop-$k$ versus Top-$k$ sparsification for three sparsity factors. Left: $S=0.4$; middle: $S=0.5$; right: $S=0.6$.
  • Figure 3: ResNet-18 on CIFAR-10 with $0.1\%$ sparsification.

Theorems & Definitions (6)

  • Definition 1: Principle MAP Problem
  • Definition 2: Top-$k$ Prior Belief
  • Proposition 1
  • proof
  • Proposition 2
  • proof