Table of Contents
Fetching ...

Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification

Ali Bereyhi, Ben Liang, Gary Boudreau, Ali Afana

TL;DR

The paper tackles learning-rate scaling caused by error accumulation in gradient sparsification for distributed SGD. It casts sparsification as Bayesian MAP inference, deriving RegTop-$k$ by combining a Top-$k$-style prior with a forward-model likelihood approximated through large deviations. The RegTop-$k$ mask uses a regularization factor $u_\mu(|1+\Delta|)$ to weight accumulated gradients, reducing deleterious learning-rate inflation and improving convergence at high compression. Empirical results on distributed linear regression and ResNet-18 training on CIFAR-10 show RegTop-$k$ achieving convergence to the global optimum at lower sparsity and higher accuracy than Top-$k$, highlighting its practical impact for communication-efficient distributed learning.

Abstract

Error accumulation is effective for gradient sparsification in distributed settings: initially-unselected gradient entries are eventually selected as their accumulated error exceeds a certain level. The accumulation essentially behaves as a scaling of the learning rate for the selected entries. Although this property prevents the slow-down of lateral movements in distributed gradient descent, it can deteriorate convergence in some settings. This work proposes a novel sparsification scheme that controls the learning rate scaling of error accumulation. The development of this scheme follows two major steps: first, gradient sparsification is formulated as an inverse probability (inference) problem, and the Bayesian optimal sparsification mask is derived as a maximum-a-posteriori estimator. Using the prior distribution inherited from Top-$k$, we derive a new sparsification algorithm which can be interpreted as a regularized form of Top-$k$. We call this algorithm regularized Top-$k$ (RegTop-$k$). It utilizes past aggregated gradients to evaluate posterior statistics of the next aggregation. It then prioritizes the local accumulated gradient entries based on these posterior statistics. We validate our derivation through numerical experiments. In distributed linear regression, it is observed that while Top-$k$ remains at a fixed distance from the global optimum, RegTop-$k$ converges to the global optimum at significantly higher compression ratios. We further demonstrate the generalization of this observation by employing RegTop-$k$ in distributed training of ResNet-18 on CIFAR-10, where it noticeably outperforms Top-$k$.

Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification

TL;DR

The paper tackles learning-rate scaling caused by error accumulation in gradient sparsification for distributed SGD. It casts sparsification as Bayesian MAP inference, deriving RegTop- by combining a Top--style prior with a forward-model likelihood approximated through large deviations. The RegTop- mask uses a regularization factor to weight accumulated gradients, reducing deleterious learning-rate inflation and improving convergence at high compression. Empirical results on distributed linear regression and ResNet-18 training on CIFAR-10 show RegTop- achieving convergence to the global optimum at lower sparsity and higher accuracy than Top-, highlighting its practical impact for communication-efficient distributed learning.

Abstract

Error accumulation is effective for gradient sparsification in distributed settings: initially-unselected gradient entries are eventually selected as their accumulated error exceeds a certain level. The accumulation essentially behaves as a scaling of the learning rate for the selected entries. Although this property prevents the slow-down of lateral movements in distributed gradient descent, it can deteriorate convergence in some settings. This work proposes a novel sparsification scheme that controls the learning rate scaling of error accumulation. The development of this scheme follows two major steps: first, gradient sparsification is formulated as an inverse probability (inference) problem, and the Bayesian optimal sparsification mask is derived as a maximum-a-posteriori estimator. Using the prior distribution inherited from Top-, we derive a new sparsification algorithm which can be interpreted as a regularized form of Top-. We call this algorithm regularized Top- (RegTop-). It utilizes past aggregated gradients to evaluate posterior statistics of the next aggregation. It then prioritizes the local accumulated gradient entries based on these posterior statistics. We validate our derivation through numerical experiments. In distributed linear regression, it is observed that while Top- remains at a fixed distance from the global optimum, RegTop- converges to the global optimum at significantly higher compression ratios. We further demonstrate the generalization of this observation by employing RegTop- in distributed training of ResNet-18 on CIFAR-10, where it noticeably outperforms Top-.
Paper Structure (28 sections, 2 theorems, 83 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 28 sections, 2 theorems, 83 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

The posterior probability $P _{ n \left[ j\right]} ^t$ is alternatively computed as where $\mathbbmss{F}_j^k \subset \mathbbmss{R}^J$ is the set of points in $\mathbbmss{R}^J$ whose $j$-th entries are among their largest $k$ entries and $q_n({{\mathbf{a}}^t})$ is

Figures (7)

  • Figure 1: Example of large learning rate scaling in Top-$k$. Since the largest local entries cancel out after aggregation at the server, Top-$k$ makes no progress over many iterations.
  • Figure 2: The feasible regions $\mathbbmss{F}_{1}^k$ and $\mathbbmss{F}_{2}^k$ for $J=2$ and $k=1$.
  • Figure 3: Optimality gap vs number of iterations for various sparsity factors $S=0.4$ (top left), $S=0.5$ (top right), $S=0.6$ (bottom left), and $S=0.9$ (bottom right). RegTop-$k$ starts to converge to the global optimum as $S$ surpasses a specific threshold, whereas Top-$k$ keeps converging to a point in the vicinity of the global optimum.
  • Figure 4: Homogeneity (left) vs heterogeneity (right): with heterogeneity, Top-$k$ remains away from the global optimum, whereas RegTop-$k$ converges to it.
  • Figure 5: Optimality gap vs sparsity. Top-$k$ converges to the global optimum only at $S=1$, whereas RegTop-$k$ starts converging at $S=0.55$.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Definition 1: Principle MAP Problem
  • Definition 2: Top-$k$ Prior Belief
  • Proposition 1
  • proof
  • Lemma 1
  • proof
  • Remark 1
  • Remark 2
  • Remark 3