Table of Contents
Fetching ...

A Tale of Two Learning Algorithms: Multiple Stream Random Walk and Asynchronous Gossip

Peyman Gholami, Hulya Seferoglu

TL;DR

This work compares gossip-based and random-walk-based decentralized learning under varying graph topologies and data heterogeneity. It introduces asynchronous Multi-Walk (MW), a multi-stream random-walk algorithm, and provides a comprehensive convergence analysis for MW and asynchronous gossip across iterations, transmitted bits, and wall-clock time in non-convex settings. Theoretical results show MW outperforms asynchronous gossip on large-diameter graphs, while the advantage can diminish in small-diameter graphs with extreme data heterogeneity; wall-clock analysis reveals gossip benefits from parallelism, whereas MW offers superior communication efficiency in bandwidth-constrained environments. Empirical validation on cycle, complete, and ER graphs with CIFAR-10 and OPT-125M demonstrates the topology-dependent trade-offs, and code is released for reproducibility, offering practical guidelines for decentralized learning deployments.

Abstract

Although gossip and random walk-based learning algorithms are widely known for decentralized learning, there has been limited theoretical and experimental analysis to understand their relative performance for different graph topologies and data heterogeneity. We first design and analyze a random walk-based learning algorithm with multiple streams (walks), which we name asynchronous "Multi-Walk (MW)". We provide a convergence analysis for MW w.r.t iteration (computation), wall-clock time, and communication. We also present a convergence analysis for "Asynchronous Gossip", noting the lack of a comprehensive analysis of its convergence, along with the computation and communication overhead, in the literature. Our results show that MW has better convergence in terms of iterations as compared to Asynchronous Gossip in graphs with large diameters (e.g., cycles), while its relative performance, as compared to Asynchronous Gossip, depends on the number of walks and the data heterogeneity in graphs with small diameters (e.g., complete graphs). In wall-clock time analysis, we observe a linear speed-up with the number of walks and nodes in MW and Asynchronous Gossip, respectively. Finally, we show that MW outperforms Asynchronous Gossip in communication overhead, except in small-diameter topologies with extreme data heterogeneity. These results highlight the effectiveness of each algorithm in different graph topologies and data heterogeneity. Our codes are available for reproducibility.

A Tale of Two Learning Algorithms: Multiple Stream Random Walk and Asynchronous Gossip

TL;DR

This work compares gossip-based and random-walk-based decentralized learning under varying graph topologies and data heterogeneity. It introduces asynchronous Multi-Walk (MW), a multi-stream random-walk algorithm, and provides a comprehensive convergence analysis for MW and asynchronous gossip across iterations, transmitted bits, and wall-clock time in non-convex settings. Theoretical results show MW outperforms asynchronous gossip on large-diameter graphs, while the advantage can diminish in small-diameter graphs with extreme data heterogeneity; wall-clock analysis reveals gossip benefits from parallelism, whereas MW offers superior communication efficiency in bandwidth-constrained environments. Empirical validation on cycle, complete, and ER graphs with CIFAR-10 and OPT-125M demonstrates the topology-dependent trade-offs, and code is released for reproducibility, offering practical guidelines for decentralized learning deployments.

Abstract

Although gossip and random walk-based learning algorithms are widely known for decentralized learning, there has been limited theoretical and experimental analysis to understand their relative performance for different graph topologies and data heterogeneity. We first design and analyze a random walk-based learning algorithm with multiple streams (walks), which we name asynchronous "Multi-Walk (MW)". We provide a convergence analysis for MW w.r.t iteration (computation), wall-clock time, and communication. We also present a convergence analysis for "Asynchronous Gossip", noting the lack of a comprehensive analysis of its convergence, along with the computation and communication overhead, in the literature. Our results show that MW has better convergence in terms of iterations as compared to Asynchronous Gossip in graphs with large diameters (e.g., cycles), while its relative performance, as compared to Asynchronous Gossip, depends on the number of walks and the data heterogeneity in graphs with small diameters (e.g., complete graphs). In wall-clock time analysis, we observe a linear speed-up with the number of walks and nodes in MW and Asynchronous Gossip, respectively. Finally, we show that MW outperforms Asynchronous Gossip in communication overhead, except in small-diameter topologies with extreme data heterogeneity. These results highlight the effectiveness of each algorithm in different graph topologies and data heterogeneity. Our codes are available for reproducibility.

Paper Structure

This paper contains 37 sections, 9 theorems, 86 equations, 8 figures, 6 tables, 2 algorithms.

Key Result

Theorem 4.1

Multi-Walk (MW). Let assumptions as1-as4 hold, with a constant and small enough learning rate $\eta$ (potentially depending on $T$), after $T$ iterations of Algorithm alg:MW, $\frac{1}{T}\sum_{t=0}^{T-1}\mathop{\mathrm{\mathbb{E}}}\nolimits\| \nabla f(\mathbf{x}^{r_t}_{t}) \|^2$ is where $F := f(\mathbf{x}_0)-f^*$, and $H^2$ is the second moment of the first return time to Node $0$ for the Markov

Figures (8)

  • Figure 1: Example instance of MW in a $3$-node network with two walks ($R=2$), where $t$ represents the iteration number.
  • Figure 2: Training loss of ResNet-$20$ on Cifar-$10$ on a $20$-node graph with different topologies.
  • Figure 3: Training of ResNet-$20$ on Cifar-$10$ on and $20$-node Erdős–Rényi ($0.3$) graph for different levels of noniid-ness.
  • Figure 4: Fine-tuning OPT-$125$M on the MultiNLI corpus in a $20$-node Erdős–Rényi ($0.3$) graph.
  • Figure 5: Different levels of noniid-ness using Dirichlet distribution with different values of $\alpha$ for CIFAR-10.
  • ...and 3 more figures

Theorems & Definitions (14)

  • Theorem 4.1
  • Theorem 4.2
  • Corollary 4.3
  • Corollary 4.4
  • Lemma 2.1: Descent Lemma for Multi-Walk
  • proof
  • Lemma 2.2: Bounding Deviation for Multi-Walk
  • proof
  • Lemma 2.3: Similar to Lemma 16 in unified-koloskova20a
  • proof
  • ...and 4 more