ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less
Qiao Tan, Feng Zhu, Jingjing Zhang
TL;DR
ABS introduces an adaptive bounded staleness strategy for distributed PS-based SGD that jointly tunes the number of workers the PS waits for and the freshness of gradients. By using a dynamic staleness threshold $\tau_{\max}^t$ and an increasing $K^t$, along with a restart rule for highly stale workers and local SGD to reduce communication, ABS achieves faster wall-clock convergence and fewer communication rounds on non-convex objectives. Theoretical guarantees show ergodic convergence with a rate of $\mathcal{O}(1/\sqrt{TKU})$, and empirical results on CIFAR-10 demonstrate superior performance over AdaSync, SA-AdaSync, and Local SGD. Overall, ABS offers a practical, tunable mechanism to balance speed and accuracy in large-scale distributed training with stragglers.
Abstract
Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.
