Table of Contents
Fetching ...

ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less

Qiao Tan, Feng Zhu, Jingjing Zhang

TL;DR

ABS introduces an adaptive bounded staleness strategy for distributed PS-based SGD that jointly tunes the number of workers the PS waits for and the freshness of gradients. By using a dynamic staleness threshold $\tau_{\max}^t$ and an increasing $K^t$, along with a restart rule for highly stale workers and local SGD to reduce communication, ABS achieves faster wall-clock convergence and fewer communication rounds on non-convex objectives. Theoretical guarantees show ergodic convergence with a rate of $\mathcal{O}(1/\sqrt{TKU})$, and empirical results on CIFAR-10 demonstrate superior performance over AdaSync, SA-AdaSync, and Local SGD. Overall, ABS offers a practical, tunable mechanism to balance speed and accuracy in large-scale distributed training with stragglers.

Abstract

Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.

ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less

TL;DR

ABS introduces an adaptive bounded staleness strategy for distributed PS-based SGD that jointly tunes the number of workers the PS waits for and the freshness of gradients. By using a dynamic staleness threshold and an increasing , along with a restart rule for highly stale workers and local SGD to reduce communication, ABS achieves faster wall-clock convergence and fewer communication rounds on non-convex objectives. Theoretical guarantees show ergodic convergence with a rate of , and empirical results on CIFAR-10 demonstrate superior performance over AdaSync, SA-AdaSync, and Local SGD. Overall, ABS offers a practical, tunable mechanism to balance speed and accuracy in large-scale distributed training with stragglers.

Abstract

Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.
Paper Structure (14 sections, 4 theorems, 29 equations, 4 figures, 1 algorithm)

This paper contains 14 sections, 4 theorems, 29 equations, 4 figures, 1 algorithm.

Key Result

Lemma 1

Let $g(\mathbf{w}^{t-\tau_k^t,u}_k,\xi^{t-\tau_k^t,u}_k)= \frac{1}{B}\sum_{b=1}^B\nabla f(\mathbf{w}_k^{t-\tau_k^t,u}; \xi_{k,b}^{t-\tau_k^t, u})$ be the batch gradient given $\mathbf{w}^{t-\tau_k^t,u}_k$ at iteration $t-\tau^t_k$ in local step u, and use the symbol $\mathbf{v}^{t,u}_k = g(\mathbf{

Figures (4)

  • Figure 1: The performance of $K$-async ($K$=1) with $N=8$ and different values of the threshold $\tau_{max}$ of gradient staleness.
  • Figure 2: The performance of ABS compared to AdaSync, SA-AdaSync, and local SGD with $N=10$. (a) Learning accuracy vs. time. (b) Learning accuracy vs. communication rounds
  • Figure 3: The performance of ABS compared to AdaSync, SA-AdaSync, and local SGD with $N=20$. (a) Learning accuracy vs. time. (b) Learning accuracy vs. communication rounds
  • Figure 4: The performance of ABS with different parameter $a$ when $K^0 = 5$ and $N = 20$. (a) Learning accuracy vs. time. (b) Learning accuracy vs. communication rounds

Theorems & Definitions (12)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • proof
  • proof
  • ...and 2 more