Table of Contents
Fetching ...

Asynchronous Multi-Server Federated Learning for Geo-Distributed Clients

Yuncong Zuo, Bart Cox, Lydia Y. Chen, Jérémie Decouchant

TL;DR

This work introduces Spyker, the first fully asynchronous flat multi-server federated learning framework designed for geo-distributed clients. Spyker employs a token-based mechanism and model aging to coordinate asynchronous inter-server exchanges and client updates, maintaining active servers and mitigating drift across heterogeneous data and resources. The approach combines local learning-rate decay to balance contributions from fast and slow clients with an asynchronous, peer-to-peer server aggregation scheme, achieving similar or higher accuracy than baselines while reducing convergence time by about 61% in geo-distributed settings. Empirical results on MNIST, CIFAR-10, and WikiText-2 demonstrate Spyker’s superior scalability to more clients and servers and its robustness to network delays and data heterogeneity, highlighting its practical impact for real-world geo-distributed FL deployments.

Abstract

Federated learning (FL) systems enable multiple clients to train a machine learning model iteratively through synchronously exchanging the intermediate model weights with a single server. The scalability of such FL systems can be limited by two factors: server idle time due to synchronous communication and the risk of a single server becoming the bottleneck. In this paper, we propose a new FL architecture, to our knowledge, the first multi-server FL system that is entirely asynchronous, and therefore addresses these two limitations simultaneously. Our solution keeps both servers and clients continuously active. As in previous multi-server methods, clients interact solely with their nearest server, ensuring efficient update integration into the model. Differently, however, servers also periodically update each other asynchronously, and never postpone interactions with clients. We compare our solution to three representative baselines - FedAvg, FedAsync and HierFAVG - on the MNIST and CIFAR-10 image classification datasets and on the WikiText-2 language modeling dataset. Our solution converges to similar or higher accuracy levels than previous baselines and requires 61% less time to do so in geo-distributed settings.

Asynchronous Multi-Server Federated Learning for Geo-Distributed Clients

TL;DR

This work introduces Spyker, the first fully asynchronous flat multi-server federated learning framework designed for geo-distributed clients. Spyker employs a token-based mechanism and model aging to coordinate asynchronous inter-server exchanges and client updates, maintaining active servers and mitigating drift across heterogeneous data and resources. The approach combines local learning-rate decay to balance contributions from fast and slow clients with an asynchronous, peer-to-peer server aggregation scheme, achieving similar or higher accuracy than baselines while reducing convergence time by about 61% in geo-distributed settings. Empirical results on MNIST, CIFAR-10, and WikiText-2 demonstrate Spyker’s superior scalability to more clients and servers and its robustness to network delays and data heterogeneity, highlighting its practical impact for real-world geo-distributed FL deployments.

Abstract

Federated learning (FL) systems enable multiple clients to train a machine learning model iteratively through synchronously exchanging the intermediate model weights with a single server. The scalability of such FL systems can be limited by two factors: server idle time due to synchronous communication and the risk of a single server becoming the bottleneck. In this paper, we propose a new FL architecture, to our knowledge, the first multi-server FL system that is entirely asynchronous, and therefore addresses these two limitations simultaneously. Our solution keeps both servers and clients continuously active. As in previous multi-server methods, clients interact solely with their nearest server, ensuring efficient update integration into the model. Differently, however, servers also periodically update each other asynchronously, and never postpone interactions with clients. We compare our solution to three representative baselines - FedAvg, FedAsync and HierFAVG - on the MNIST and CIFAR-10 image classification datasets and on the WikiText-2 language modeling dataset. Our solution converges to similar or higher accuracy levels than previous baselines and requires 61% less time to do so in geo-distributed settings.
Paper Structure (22 sections, 4 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 22 sections, 4 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: A training round of a synchronous FL system with heterogeneous devices and networks. The times at which the server receives client updates are indicated with colored spheres. The server waits for all client updates to be received to update the model, which results in a mostly idle server.
  • Figure 2: Architectures of FL systems
  • Figure 3: WikiText2: Perplexity wrt. time. (lower is better)
  • Figure 4: WikiText2: Perplexity wrt. # updates (lower is better)
  • Figure 5: MNIST: Accuracy wrt. time (higher is better)
  • ...and 7 more figures