Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates
Cabrel Teguemne Fokam, Khaleelulla Khan Nazeer, Lukas König, David Kappel, Anand Subramoney
TL;DR
This work introduces Partial Decoupled ASGD (PD-ASGD), a training paradigm that decouples forward and backward passes and employs partial layer-wise updates to mitigate staleness in asynchronous SGD. By enabling separate forward/backward threads and concurrent layer updates, PD-ASGD achieves higher throughput and robustness to delays, with reported speedups up to $5.95\times$ over synchronous data-parallel training and up to $2.14\times$ over comparable ASGD methods while maintaining competitive accuracy on vision and language tasks. The authors provide a bias bound $||b|| \le G \frac{\alpha \eta \tau_{\max}}{1-\alpha \eta \tau_{\max}}$ and model convergence to a stationary distribution $p^{*}(\boldsymbol{\theta}) \propto \exp(\sum_k h_k(\boldsymbol{\theta}))$, showing the method remains well-behaved when stale noise is small relative to minibatch noise, and they analyze forward/backward misalignment to quantify gradient bias. Empirical results on CIFAR-10/100 and IMDb demonstrate improved hardware utilization and robustness to delays, with ablations highlighting the necessity of partial layer-wise updates for reducing staleness and preserving accuracy. Overall, PD-ASGD offers a promising direction for scalable, asynchronous training on distributed and heterogeneous hardware, balancing speed, accuracy, and convergence guarantees.
Abstract
The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads when training across devices, leading to longer training times as a result of suboptimal hardware utilization. Asynchronous stochastic gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and differences throughput. Moreover, the backpropagation algorithm used within ASGD workers is bottlenecked by the interlocking between its forward and backward passes. Current methods also do not take advantage of the large differences in the computation required for the forward and backward passes. Therefore, we propose an extension to ASGD called Partial Decoupled ASGD (PD-ASGD) that addresses these issues. PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads than the usual 1:1 ratio, leading to higher throughput. PD-ASGD also performs layer-wise (partial) model updates concurrently across multiple threads. This reduces parameter staleness and consequently improves robustness to delays. Our approach yields close to state-of-the-art results while running up to $5.95\times$ faster than synchronous data parallelism in the presence of delays, and up to $2.14\times$ times faster than comparable ASGD algorithms by achieving higher model flops utilization. We mathematically describe the gradient bias introduced by our method, establish an upper bound, and prove convergence.
