Asynchronous Multi-Server Federated Learning for Geo-Distributed Clients
Yuncong Zuo, Bart Cox, Lydia Y. Chen, Jérémie Decouchant
TL;DR
This work introduces Spyker, the first fully asynchronous flat multi-server federated learning framework designed for geo-distributed clients. Spyker employs a token-based mechanism and model aging to coordinate asynchronous inter-server exchanges and client updates, maintaining active servers and mitigating drift across heterogeneous data and resources. The approach combines local learning-rate decay to balance contributions from fast and slow clients with an asynchronous, peer-to-peer server aggregation scheme, achieving similar or higher accuracy than baselines while reducing convergence time by about 61% in geo-distributed settings. Empirical results on MNIST, CIFAR-10, and WikiText-2 demonstrate Spyker’s superior scalability to more clients and servers and its robustness to network delays and data heterogeneity, highlighting its practical impact for real-world geo-distributed FL deployments.
Abstract
Federated learning (FL) systems enable multiple clients to train a machine learning model iteratively through synchronously exchanging the intermediate model weights with a single server. The scalability of such FL systems can be limited by two factors: server idle time due to synchronous communication and the risk of a single server becoming the bottleneck. In this paper, we propose a new FL architecture, to our knowledge, the first multi-server FL system that is entirely asynchronous, and therefore addresses these two limitations simultaneously. Our solution keeps both servers and clients continuously active. As in previous multi-server methods, clients interact solely with their nearest server, ensuring efficient update integration into the model. Differently, however, servers also periodically update each other asynchronously, and never postpone interactions with clients. We compare our solution to three representative baselines - FedAvg, FedAsync and HierFAVG - on the MNIST and CIFAR-10 image classification datasets and on the WikiText-2 language modeling dataset. Our solution converges to similar or higher accuracy levels than previous baselines and requires 61% less time to do so in geo-distributed settings.
