Towards Practical Overlay Networks for Decentralized Federated Learning
Yifan Hua, Jinlong Pang, Xiaoxue Zhang, Yi Liu, Xiaofeng Shi, Bao Wang, Yang Liu, Chen Qian
TL;DR
FedLay introduces a fully decentralized overlay for Decentralized Federated Learning that achieves fast model convergence, high accuracy, and low communication with resilience to node churn. It constructs near-random regular topologies via multiple virtual ring spaces and greedy routing, enabling decentralized neighbor discovery and maintenance without a central server. A two-part protocol stack combines Neighbor Discovery and Maintenance Protocols (NDMP) for topology upkeep with a Model Exchange Protocol (MEP) that uses confidence-weighted asynchronous exchanges and model fingerprinting to mitigate low-quality transmissions. Empirical results on real deployments, emulations, and simulations demonstrate FedLay outperforms existing DFL overlays in convergence speed, accuracy, and resilience while maintaining modest communication costs, making practical DFL with decentralized topology feasible.
Abstract
Decentralized federated learning (DFL) uses peer-to-peer communication to avoid the single point of failure problem in federated learning and has been considered an attractive solution for machine learning tasks on distributed devices. We provide the first solution to a fundamental network problem of DFL: what overlay network should DFL use to achieve fast training of highly accurate models, low communication, and decentralized construction and maintenance? Overlay topologies of DFL have been investigated, but no existing DFL topology includes decentralized protocols for network construction and topology maintenance. Without these protocols, DFL cannot run in practice. This work presents an overlay network, called FedLay, which provides fast training and low communication cost for practical DFL. FedLay is the first solution for constructing near-random regular topologies in a decentralized manner and maintaining the topologies under node joins and failures. Experiments based on prototype implementation and simulations show that FedLay achieves the fastest model convergence and highest accuracy on real datasets compared to existing DFL solutions while incurring small communication costs and being resilient to node joins and failures.
