Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices
Anirudh Rajiv Menon, Unnikrishnan Menon, Kailash Ahirwar
TL;DR
Ravnest addresses the challenge of training billion-parameter models beyond single machine memory by introducing an asynchronous decentralized framework. It clusters heterogeneous PCs into groups that perform Zero-Bubble Asynchronous Model Parallelism within clusters and uses a Parallel Multi-Ring All-Reduce across clusters for global parameter averaging. Theoretical analysis yields a convergence rate of $O(1/\sqrt{K})$ with conditions for linear speedup under bounded staleness, and experiments show substantial memory reductions with competitive convergence relative to synchronous baselines. The approach democratizes large-scale model training by enabling efficient use of commodity hardware across the internet with fault tolerance and scalable cluster formation. The results indicate Ravnest can meaningfully reduce hardware requirements and training costs while maintaining robust convergence behavior.
Abstract
Modern deep learning models, growing larger and more complex, have demonstrated exceptional generalization and accuracy due to training on huge datasets. This trend is expected to continue. However, the increasing size of these models poses challenges in training, as traditional centralized methods are limited by memory constraints at such scales. This paper proposes an asynchronous decentralized training paradigm for large modern deep learning models that harnesses the compute power of regular heterogeneous PCs with limited resources connected across the internet to achieve favourable performance metrics. Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters with similar data transfer rates and compute capabilities, without necessitating that each node hosts the entire model. These clusters engage in $\textit{Zero-Bubble Asynchronous Model Parallel}$ training, and a $\textit{Parallel Multi-Ring All-Reduce}$ method is employed to effectively execute global parameter averaging across all clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates and derived an optimal convergence rate of $O\left(\frac{1}{\sqrt{K}}\right)$. We further discuss linear speedup with respect to the number of participating clusters and the bound on the staleness parameter.
