Table of Contents
Fetching ...

Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

Zehan Zhu, Ye Tian, Yan Huang, Jinming Xu, Shibo He

TL;DR

The paper tackles the inefficiency of perfect synchronization in distributed training by introducing R-FAST, a robust fully-asynchronous gradient-tracking method that operates over general spanning-tree topologies sharing a common root. It combines asynchronous execution, dual spanning-tree communication, and a robust gradient-tracking scheme with buffering to mitigate data heterogeneity and packet losses, supported by a thorough augmented-system convergence analysis for both strongly convex and non-convex objectives. The authors prove linear convergence to a neighborhood for smooth strongly convex F and sublinear convergence to stationary points for non-convex F, using a two-time-scale approach to handle delays and root activations. Empirically, R-FAST delivers 1.5-2x faster convergence than synchronous baselines like Ring-AllReduce and D-PSGD, while outperforming asynchronous SOTA methods in the presence of stragglers, and it scales effectively with the number of nodes and flexible network topologies.

Abstract

Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across devices and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. More importantly, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication architectures. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex settings. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms existing asynchronous SOTA algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.

Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

TL;DR

The paper tackles the inefficiency of perfect synchronization in distributed training by introducing R-FAST, a robust fully-asynchronous gradient-tracking method that operates over general spanning-tree topologies sharing a common root. It combines asynchronous execution, dual spanning-tree communication, and a robust gradient-tracking scheme with buffering to mitigate data heterogeneity and packet losses, supported by a thorough augmented-system convergence analysis for both strongly convex and non-convex objectives. The authors prove linear convergence to a neighborhood for smooth strongly convex F and sublinear convergence to stationary points for non-convex F, using a two-time-scale approach to handle delays and root activations. Empirically, R-FAST delivers 1.5-2x faster convergence than synchronous baselines like Ring-AllReduce and D-PSGD, while outperforming asynchronous SOTA methods in the presence of stragglers, and it scales effectively with the number of nodes and flexible network topologies.

Abstract

Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across devices and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. More importantly, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication architectures. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex settings. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms existing asynchronous SOTA algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.
Paper Structure (26 sections, 14 theorems, 149 equations, 15 figures, 3 tables, 2 algorithms)

This paper contains 26 sections, 14 theorems, 149 equations, 15 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

Suppose Assumptions Ass_weight_matrix-Ass_asyn hold. Define $K_1 \triangleq \left( 2n-1 \right) \cdot T+n\cdot D$, $C_2 \triangleq \frac{2\sqrt{\left( D+2 \right) n}\left( 1+\bar{m}^{-K_1} \right)}{1-\bar{m}^{K_1}}$, $\eta \triangleq \bar{m}^{K_1}$ and $\rho \triangleq (1-\eta)^\frac{1}{K_1}$. We ha

Figures (15)

  • Figure 1: An illustration of constructing two communication graphs over a strongly connected topology $\mathcal{G}$ wiht 4 nodes. $\mathcal{G}\left( W \right)$ (resp., $\mathcal{G}\left( A \right)$) is a (resp., reversed) spanning tree with node $3$ being the common root.
  • Figure 2: An illustration of the storage and communications of variables within each node. The blue ellipse contains the private variables meant for internal computations, and the red rectangle contains the communicating variables to be sent to out-neighbors. Solid lines indicate actual communications in the network while dotted lines indicate internal computational dependencies.
  • Figure 3: The network topologies. (a): binary tree graph; (b): directed ring graph; (c): line graph.
  • Figure 4: Performance of R-FAST in training logistic regression model in terms of training loss versus epoch over (a) five different topologies (composed of 7 nodes) and (b) binary tree topology with different number of nodes.
  • Figure 5: Performance comparison of R-FAST with D-PSGD, S-AB, AD-PSGD, OSGP and Ring-AllReduce in training ResNet-50 when there is no straggler.
  • ...and 10 more figures

Theorems & Definitions (38)

  • Remark 1
  • Remark 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • proof
  • Remark 3
  • Lemma 4
  • proof
  • Proposition 1
  • ...and 28 more