A Tale of Two Learning Algorithms: Multiple Stream Random Walk and Asynchronous Gossip
Peyman Gholami, Hulya Seferoglu
TL;DR
This work compares gossip-based and random-walk-based decentralized learning under varying graph topologies and data heterogeneity. It introduces asynchronous Multi-Walk (MW), a multi-stream random-walk algorithm, and provides a comprehensive convergence analysis for MW and asynchronous gossip across iterations, transmitted bits, and wall-clock time in non-convex settings. Theoretical results show MW outperforms asynchronous gossip on large-diameter graphs, while the advantage can diminish in small-diameter graphs with extreme data heterogeneity; wall-clock analysis reveals gossip benefits from parallelism, whereas MW offers superior communication efficiency in bandwidth-constrained environments. Empirical validation on cycle, complete, and ER graphs with CIFAR-10 and OPT-125M demonstrates the topology-dependent trade-offs, and code is released for reproducibility, offering practical guidelines for decentralized learning deployments.
Abstract
Although gossip and random walk-based learning algorithms are widely known for decentralized learning, there has been limited theoretical and experimental analysis to understand their relative performance for different graph topologies and data heterogeneity. We first design and analyze a random walk-based learning algorithm with multiple streams (walks), which we name asynchronous "Multi-Walk (MW)". We provide a convergence analysis for MW w.r.t iteration (computation), wall-clock time, and communication. We also present a convergence analysis for "Asynchronous Gossip", noting the lack of a comprehensive analysis of its convergence, along with the computation and communication overhead, in the literature. Our results show that MW has better convergence in terms of iterations as compared to Asynchronous Gossip in graphs with large diameters (e.g., cycles), while its relative performance, as compared to Asynchronous Gossip, depends on the number of walks and the data heterogeneity in graphs with small diameters (e.g., complete graphs). In wall-clock time analysis, we observe a linear speed-up with the number of walks and nodes in MW and Asynchronous Gossip, respectively. Finally, we show that MW outperforms Asynchronous Gossip in communication overhead, except in small-diameter topologies with extreme data heterogeneity. These results highlight the effectiveness of each algorithm in different graph topologies and data heterogeneity. Our codes are available for reproducibility.
