Asynchronous Training Schemes in Distributed Learning with Time Delay

Haoxiang Wang; Zhanhong Jiang; Chao Liu; Soumik Sarkar; Dongxiang Jiang; Young M. Lee

Asynchronous Training Schemes in Distributed Learning with Time Delay

Haoxiang Wang, Zhanhong Jiang, Chao Liu, Soumik Sarkar, Dongxiang Jiang, Young M. Lee

TL;DR

Theoretically, the convergence rate considering the effects of delay of the proposed algorithm with constant step size when the smooth objective functions are weakly strongly-convex and nonconveX and the condition to help with the determination of the tradeoff parameter is presented.

Abstract

In the context of distributed deep learning, the issue of stale weights or gradients could result in poor algorithmic performance. This issue is usually tackled by delay tolerant algorithms with some mild assumptions on the objective functions and step sizes. In this paper, we propose a different approach to develop a new algorithm, called $\textbf{P}$redicting $\textbf{C}$lipping $\textbf{A}$synchronous $\textbf{S}$tochastic $\textbf{G}$radient $\textbf{D}$escent (aka, PC-ASGD). Specifically, PC-ASGD has two steps - the $\textit{predicting step}$ leverages the gradient prediction using Taylor expansion to reduce the staleness of the outdated weights while the $\textit{clipping step}$ selectively drops the outdated weights to alleviate their negative effects. A tradeoff parameter is introduced to balance the effects between these two steps. Theoretically, we present the convergence rate considering the effects of delay of the proposed algorithm with constant step size when the smooth objective functions are weakly strongly-convex and nonconvex. One practical variant of PC-ASGD is also proposed by adopting a condition to help with the determination of the tradeoff parameter. For empirical validation, we demonstrate the performance of the algorithm with two deep neural network architectures on two benchmark datasets.

Asynchronous Training Schemes in Distributed Learning with Time Delay

TL;DR

Abstract

redicting

lipping

synchronous

tochastic

radient

escent (aka, PC-ASGD). Specifically, PC-ASGD has two steps - the

leverages the gradient prediction using Taylor expansion to reduce the staleness of the outdated weights while the

selectively drops the outdated weights to alleviate their negative effects. A tradeoff parameter is introduced to balance the effects between these two steps. Theoretically, we present the convergence rate considering the effects of delay of the proposed algorithm with constant step size when the smooth objective functions are weakly strongly-convex and nonconvex. One practical variant of PC-ASGD is also proposed by adopting a condition to help with the determination of the tradeoff parameter. For empirical validation, we demonstrate the performance of the algorithm with two deep neural network architectures on two benchmark datasets.

Paper Structure (17 sections, 3 theorems, 86 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 17 sections, 3 theorems, 86 equations, 7 figures, 4 tables, 2 algorithms.

Introduction
Formulation and Preliminaries
PC-ASGD
Algorithm Design
Convergence Analysis
Experiments
Practical Variant
Distributed Network and Learning Setting
Performance Evaluation
Impacts of Different Delay Settings
Impacts of Network Size
Numerical Studies on theta Assignments
Time Cost Comparison
Verification Using Simple Functions
Conclusion
...and 2 more sections

Key Result

Lemma 1

(Consensus) Let Assumptions 2 and 3 hold. Assume that the delay compensated gradients are uniformly bounded, i.e., there exists a scalar $B>0$, such that Then for all $i\in V$ and $t\geq0$, $\exists \eta > 0$, we have where $\theta_m=\text{max}\{\theta_{s+1}\}^{t+\tau-1}_{s=t}$, $\delta_2=\text{max}\{\theta_se_2+(1-\theta_s)\tilde{e}_2\}^{t+\tau-1}_{s=0}<1$, where $e_{2}:=e_{2}(W) < 1$ and $\til

Figures (7)

Figure 1: Testing accuracy on CIFAR-10 and CIFAR-100.
Figure 2: Performance evaluation for different steps of delay.
Figure 3: Performance evaluation for different numbers of agents.
Figure 4: Predicting and clipping steps choices changing with epochs.
Figure 5: Average time costs for different methods (per epoch).
...and 2 more figures

Theorems & Definitions (14)

Definition 1
Definition 2
Lemma 1
Remark 1
Theorem 1
Theorem 2
Remark 2
proof
proof
proof
...and 4 more

Asynchronous Training Schemes in Distributed Learning with Time Delay

TL;DR

Abstract

Asynchronous Training Schemes in Distributed Learning with Time Delay

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (14)