Table of Contents
Fetching ...

Boosting Asynchronous Decentralized Learning with Model Fragmentation

Sayan Biswas, Anne-Marie Kermarrec, Alexis Marouani, Rafael Pires, Rishi Sharma, Martijn de Vos

TL;DR

DivShare tackles the bottleneck of communication stragglers in decentralized learning by fragmenting local models into small, randomly distributed fragments sent to multiple peers in parallel. The approach yields robust asynchronous convergence with a theoretical guarantee that accounts for delays, and empirically improves time-to-accuracy and accuracy over state-of-the-art baselines on CIFAR-10 and MovieLens. By leveraging fragmentation and random dissemination, DivShare efficiently utilizes bandwidth and mitigates slow-link impact, delivering up to 3.9x speedups and notable accuracy gains in both synthetic and real-world network conditions. The work demonstrates that careful handling of asynchronous communication can significantly enhance the practicality of decentralized learning in geo-distributed deployments.

Abstract

Decentralized learning (DL) is an emerging technique that allows nodes on the web to collaboratively train machine learning models without sharing raw data. Dealing with stragglers, i.e., nodes with slower compute or communication than others, is a key challenge in DL. We present DivShare, a novel asynchronous DL algorithm that achieves fast model convergence in the presence of communication stragglers. DivShare achieves this by having nodes fragment their models into parameter subsets and send, in parallel to computation, each subset to a random sample of other nodes instead of sequentially exchanging full models. The transfer of smaller fragments allows more efficient usage of the collective bandwidth and enables nodes with slow network links to quickly contribute with at least some of their model parameters. By theoretically proving the convergence of DivShare, we provide, to the best of our knowledge, the first formal proof of convergence for a DL algorithm that accounts for the effects of asynchronous communication with delays. We experimentally evaluate DivShare against two state-of-the-art DL baselines, AD-PSGD and Swift, and with two standard datasets, CIFAR-10 and MovieLens. We find that DivShare with communication stragglers lowers time-to-accuracy by up to 3.9x compared to AD-PSGD on the CIFAR-10 dataset. Compared to baselines, DivShare also achieves up to 19.4% better accuracy and 9.5% lower test loss on the CIFAR-10 and MovieLens datasets, respectively.

Boosting Asynchronous Decentralized Learning with Model Fragmentation

TL;DR

DivShare tackles the bottleneck of communication stragglers in decentralized learning by fragmenting local models into small, randomly distributed fragments sent to multiple peers in parallel. The approach yields robust asynchronous convergence with a theoretical guarantee that accounts for delays, and empirically improves time-to-accuracy and accuracy over state-of-the-art baselines on CIFAR-10 and MovieLens. By leveraging fragmentation and random dissemination, DivShare efficiently utilizes bandwidth and mitigates slow-link impact, delivering up to 3.9x speedups and notable accuracy gains in both synthetic and real-world network conditions. The work demonstrates that careful handling of asynchronous communication can significantly enhance the practicality of decentralized learning in geo-distributed deployments.

Abstract

Decentralized learning (DL) is an emerging technique that allows nodes on the web to collaboratively train machine learning models without sharing raw data. Dealing with stragglers, i.e., nodes with slower compute or communication than others, is a key challenge in DL. We present DivShare, a novel asynchronous DL algorithm that achieves fast model convergence in the presence of communication stragglers. DivShare achieves this by having nodes fragment their models into parameter subsets and send, in parallel to computation, each subset to a random sample of other nodes instead of sequentially exchanging full models. The transfer of smaller fragments allows more efficient usage of the collective bandwidth and enables nodes with slow network links to quickly contribute with at least some of their model parameters. By theoretically proving the convergence of DivShare, we provide, to the best of our knowledge, the first formal proof of convergence for a DL algorithm that accounts for the effects of asynchronous communication with delays. We experimentally evaluate DivShare against two state-of-the-art DL baselines, AD-PSGD and Swift, and with two standard datasets, CIFAR-10 and MovieLens. We find that DivShare with communication stragglers lowers time-to-accuracy by up to 3.9x compared to AD-PSGD on the CIFAR-10 dataset. Compared to baselines, DivShare also achieves up to 19.4% better accuracy and 9.5% lower test loss on the CIFAR-10 and MovieLens datasets, respectively.

Paper Structure

This paper contains 23 sections, 2 theorems, 33 equations, 8 figures, 2 tables.

Key Result

Theorem 1

Under assump:objectiveassump:noise_varianceassump:pop_varianceassump:straggling and if ${f}^{}_{}$ is $L$-smooth for all $i \in [n]$, then $\mathbb{E}\left[\frac{1}{\Tilde{k}} \sum_{k<\Tilde{k}} {{\left\| \nabla F\left(\overline{X^k}\right)\right\|}}^2\right]$ where $\lambda_2 = {{\left\| \mathbb{E}\left[W\right]\Pi_F\right\|}}$ with $\Pi_F$ being the canonical projector on $F=\mathbf{1}^\perp$, $

Figures (8)

  • Figure 1: The convergence plots for AD-PSGD and Swift on CIFAR-10, with and without communication stragglers.
  • Figure 2: Model sharing in DL (left) and DivShare (right), from the perspective of a single node. DivShare fragments models and sends each fragment to randomly selected nodes.
  • Figure 3: Timeline of computation and communication operations in DivShare during three local rounds, from the perspective of a node with fast (top) and slow (bottom) communication. Fragments are the same number of bytes.
  • Figure 4: The model utility over time with and without stragglers on CIFAR-10 ($\uparrow$ is better) and MovieLens ($\downarrow$ is better).
  • Figure 5: (a) Accuracy after convergence and (b) time to $60\%$ accuracy on CIFAR-10. A time to accuracy of $\infty$ means that AD-PSGD did not reach the target accuracy.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Remark 1
  • Theorem 1: Convergence of DivShare
  • Remark 2
  • Lemma 2: Ergodic mixing of DivShare