Table of Contents
Fetching ...

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

Cabrel Teguemne Fokam, Khaleelulla Khan Nazeer, Lukas König, David Kappel, Anand Subramoney

TL;DR

This work introduces Partial Decoupled ASGD (PD-ASGD), a training paradigm that decouples forward and backward passes and employs partial layer-wise updates to mitigate staleness in asynchronous SGD. By enabling separate forward/backward threads and concurrent layer updates, PD-ASGD achieves higher throughput and robustness to delays, with reported speedups up to $5.95\times$ over synchronous data-parallel training and up to $2.14\times$ over comparable ASGD methods while maintaining competitive accuracy on vision and language tasks. The authors provide a bias bound $||b|| \le G \frac{\alpha \eta \tau_{\max}}{1-\alpha \eta \tau_{\max}}$ and model convergence to a stationary distribution $p^{*}(\boldsymbol{\theta}) \propto \exp(\sum_k h_k(\boldsymbol{\theta}))$, showing the method remains well-behaved when stale noise is small relative to minibatch noise, and they analyze forward/backward misalignment to quantify gradient bias. Empirical results on CIFAR-10/100 and IMDb demonstrate improved hardware utilization and robustness to delays, with ablations highlighting the necessity of partial layer-wise updates for reducing staleness and preserving accuracy. Overall, PD-ASGD offers a promising direction for scalable, asynchronous training on distributed and heterogeneous hardware, balancing speed, accuracy, and convergence guarantees.

Abstract

The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads when training across devices, leading to longer training times as a result of suboptimal hardware utilization. Asynchronous stochastic gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and differences throughput. Moreover, the backpropagation algorithm used within ASGD workers is bottlenecked by the interlocking between its forward and backward passes. Current methods also do not take advantage of the large differences in the computation required for the forward and backward passes. Therefore, we propose an extension to ASGD called Partial Decoupled ASGD (PD-ASGD) that addresses these issues. PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads than the usual 1:1 ratio, leading to higher throughput. PD-ASGD also performs layer-wise (partial) model updates concurrently across multiple threads. This reduces parameter staleness and consequently improves robustness to delays. Our approach yields close to state-of-the-art results while running up to $5.95\times$ faster than synchronous data parallelism in the presence of delays, and up to $2.14\times$ times faster than comparable ASGD algorithms by achieving higher model flops utilization. We mathematically describe the gradient bias introduced by our method, establish an upper bound, and prove convergence.

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

TL;DR

This work introduces Partial Decoupled ASGD (PD-ASGD), a training paradigm that decouples forward and backward passes and employs partial layer-wise updates to mitigate staleness in asynchronous SGD. By enabling separate forward/backward threads and concurrent layer updates, PD-ASGD achieves higher throughput and robustness to delays, with reported speedups up to over synchronous data-parallel training and up to over comparable ASGD methods while maintaining competitive accuracy on vision and language tasks. The authors provide a bias bound and model convergence to a stationary distribution , showing the method remains well-behaved when stale noise is small relative to minibatch noise, and they analyze forward/backward misalignment to quantify gradient bias. Empirical results on CIFAR-10/100 and IMDb demonstrate improved hardware utilization and robustness to delays, with ablations highlighting the necessity of partial layer-wise updates for reducing staleness and preserving accuracy. Overall, PD-ASGD offers a promising direction for scalable, asynchronous training on distributed and heterogeneous hardware, balancing speed, accuracy, and convergence guarantees.

Abstract

The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads when training across devices, leading to longer training times as a result of suboptimal hardware utilization. Asynchronous stochastic gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and differences throughput. Moreover, the backpropagation algorithm used within ASGD workers is bottlenecked by the interlocking between its forward and backward passes. Current methods also do not take advantage of the large differences in the computation required for the forward and backward passes. Therefore, we propose an extension to ASGD called Partial Decoupled ASGD (PD-ASGD) that addresses these issues. PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads than the usual 1:1 ratio, leading to higher throughput. PD-ASGD also performs layer-wise (partial) model updates concurrently across multiple threads. This reduces parameter staleness and consequently improves robustness to delays. Our approach yields close to state-of-the-art results while running up to faster than synchronous data parallelism in the presence of delays, and up to times faster than comparable ASGD algorithms by achieving higher model flops utilization. We mathematically describe the gradient bias introduced by our method, establish an upper bound, and prove convergence.
Paper Structure (19 sections, 1 theorem, 32 equations, 4 figures, 4 tables)

This paper contains 19 sections, 1 theorem, 32 equations, 4 figures, 4 tables.

Key Result

Theorem 5.1

For the bias $b$ introduced by the mismatch of the gradient between any forward and backward pass as in Eq. eq:gradient-with-error, there exist constants $G>0$ and $\alpha>0$, such that $\norm{b}$ is uniformly bounded by where $0 \leq \alpha \eta \tau_\text{max} < 1$.

Figures (4)

  • Figure 2: ResNet18 training time on CIFAR100 (left) and CIFAR10 (right) using DDP and PD-ASGD in presence of stragglers.
  • Figure 3: Learning curves of Asynchronous SGD with layer-wise updates (PD-ASGD) and Block updates (D-ASGD) on the CIFAR100 dataset. 3 independent runs are shown for each class.
  • Figure A1: learning curves of Asynchronous SGD with layer-wise updates (PD-ASGD) and Block updates (D-ASGD) for ResNet18 (top plots) and ResNet50 (bottom plots) on the CIFAR10 dataset.
  • Figure A2: ResNet18 accuracy on CIFAR100 (left) and CIFAR10 (right) using DDP and PD-ASGD in presence of stragglers.

Theorems & Definitions (2)

  • Theorem 5.1: Bound on gradient bias
  • proof