Table of Contents
Fetching ...

Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices

Anirudh Rajiv Menon, Unnikrishnan Menon, Kailash Ahirwar

TL;DR

Ravnest addresses the challenge of training billion-parameter models beyond single machine memory by introducing an asynchronous decentralized framework. It clusters heterogeneous PCs into groups that perform Zero-Bubble Asynchronous Model Parallelism within clusters and uses a Parallel Multi-Ring All-Reduce across clusters for global parameter averaging. Theoretical analysis yields a convergence rate of $O(1/\sqrt{K})$ with conditions for linear speedup under bounded staleness, and experiments show substantial memory reductions with competitive convergence relative to synchronous baselines. The approach democratizes large-scale model training by enabling efficient use of commodity hardware across the internet with fault tolerance and scalable cluster formation. The results indicate Ravnest can meaningfully reduce hardware requirements and training costs while maintaining robust convergence behavior.

Abstract

Modern deep learning models, growing larger and more complex, have demonstrated exceptional generalization and accuracy due to training on huge datasets. This trend is expected to continue. However, the increasing size of these models poses challenges in training, as traditional centralized methods are limited by memory constraints at such scales. This paper proposes an asynchronous decentralized training paradigm for large modern deep learning models that harnesses the compute power of regular heterogeneous PCs with limited resources connected across the internet to achieve favourable performance metrics. Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters with similar data transfer rates and compute capabilities, without necessitating that each node hosts the entire model. These clusters engage in $\textit{Zero-Bubble Asynchronous Model Parallel}$ training, and a $\textit{Parallel Multi-Ring All-Reduce}$ method is employed to effectively execute global parameter averaging across all clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates and derived an optimal convergence rate of $O\left(\frac{1}{\sqrt{K}}\right)$. We further discuss linear speedup with respect to the number of participating clusters and the bound on the staleness parameter.

Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices

TL;DR

Ravnest addresses the challenge of training billion-parameter models beyond single machine memory by introducing an asynchronous decentralized framework. It clusters heterogeneous PCs into groups that perform Zero-Bubble Asynchronous Model Parallelism within clusters and uses a Parallel Multi-Ring All-Reduce across clusters for global parameter averaging. Theoretical analysis yields a convergence rate of with conditions for linear speedup under bounded staleness, and experiments show substantial memory reductions with competitive convergence relative to synchronous baselines. The approach democratizes large-scale model training by enabling efficient use of commodity hardware across the internet with fault tolerance and scalable cluster formation. The results indicate Ravnest can meaningfully reduce hardware requirements and training costs while maintaining robust convergence behavior.

Abstract

Modern deep learning models, growing larger and more complex, have demonstrated exceptional generalization and accuracy due to training on huge datasets. This trend is expected to continue. However, the increasing size of these models poses challenges in training, as traditional centralized methods are limited by memory constraints at such scales. This paper proposes an asynchronous decentralized training paradigm for large modern deep learning models that harnesses the compute power of regular heterogeneous PCs with limited resources connected across the internet to achieve favourable performance metrics. Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters with similar data transfer rates and compute capabilities, without necessitating that each node hosts the entire model. These clusters engage in training, and a method is employed to effectively execute global parameter averaging across all clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates and derived an optimal convergence rate of . We further discuss linear speedup with respect to the number of participating clusters and the bound on the staleness parameter.
Paper Structure (22 sections, 4 theorems, 54 equations, 6 figures, 2 algorithms)

This paper contains 22 sections, 4 theorems, 54 equations, 6 figures, 2 algorithms.

Key Result

Theorem 1

Under Assumptions 1-6 and while $\eta$ satisfies the constraints set by $A_1 > 0$, $A_2 \leq 0$, $A_3 \leq 1$ and $\frac{2\eta LT^2}{CN_m}\leq 1$, the proposed algorithm satisfies:

Figures (6)

  • Figure 1: Illustration of how pipeline parallelism reduces idle time bubble huang2019gpipe
  • Figure 2: One cycle of Ring All-Reduce across 3 nodes
  • Figure 3: Zero-Bubble Asynchronous Model Parallelism within one cluster
  • Figure 4: One round of Parallel Multi-Ring All-Reduce, with 5 parallel rings, during global parameter averaging.
  • Figure 5: Comparison of validation accuracies over epochs
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Corollary 1.1
  • Lemma 1
  • Lemma 2