Table of Contents
Fetching ...

Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency

Abdelkrim Alahyane, Céline Comte, Matthieu Jonckheere, Éric Moulines

TL;DR

The paper tackles the bottleneck of straggler effects in synchronous federated learning by formulating and analyzing asynchronous FL through queueing theory. It derives a discrete Little's-law–like expression for mean gradient staleness using Jackson networks, enabling gradient-based routing optimization (G) while accounting for heterogeneous client speeds and datasets. To balance training speed and gradient accuracy, it introduces a wall-clock–time metric (H) with a tractable upper bound, and shows through simulations on real datasets that tuning routing and concurrency improves accuracy by 10–30% in practice. The framework provides practical guidelines for routing in asynchronous FL and can extend to related systems like FedBuff, highlighting the importance of queueing dynamics in distributed learning performance.

Abstract

Synchronous federated learning (FL) scales poorly with the number of clients due to the straggler effect. Algorithms like FedAsync and GeneralizedFedAsync address this limitation by enabling asynchronous communication between clients and the central server. In this work, we rely on stochastic modeling and analysis to better understand the impact of design choices in asynchronous FL algorithms, such as the concurrency level and routing probabilities, and we leverage this knowledge to optimize loss. Compared to most existing studies, we account for the joint impact of heterogeneous and variable service speeds and heterogeneous datasets at the clients. We characterize in particular a fundamental trade-off for optimizing asynchronous FL: minimizing gradient estimation errors by avoiding model parameter staleness, while also speeding up the system by increasing the throughput of model updates. Our two main contributions can be summarized as follows. First, we prove a discrete variant of Little's law to derive a closed-form expression for relative delay, a metric that quantifies staleness. This allows us to efficiently minimize the average loss per model update, which has been the gold standard in literature to date, using the upper-bound of Leconte et al. as a proxy. Second, we observe that naively optimizing this metric drastically slows down the system by overemphasizing staleness at the expense of throughput. This motivates us to introduce an alternative metric that also accounts for speed, for which we derive a tractable upper-bound that can be minimized numerically. Extensive numerical results show these optimizations enhance accuracy by 10% to 30%.

Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency

TL;DR

The paper tackles the bottleneck of straggler effects in synchronous federated learning by formulating and analyzing asynchronous FL through queueing theory. It derives a discrete Little's-law–like expression for mean gradient staleness using Jackson networks, enabling gradient-based routing optimization (G) while accounting for heterogeneous client speeds and datasets. To balance training speed and gradient accuracy, it introduces a wall-clock–time metric (H) with a tractable upper bound, and shows through simulations on real datasets that tuning routing and concurrency improves accuracy by 10–30% in practice. The framework provides practical guidelines for routing in asynchronous FL and can extend to related systems like FedBuff, highlighting the importance of queueing dynamics in distributed learning performance.

Abstract

Synchronous federated learning (FL) scales poorly with the number of clients due to the straggler effect. Algorithms like FedAsync and GeneralizedFedAsync address this limitation by enabling asynchronous communication between clients and the central server. In this work, we rely on stochastic modeling and analysis to better understand the impact of design choices in asynchronous FL algorithms, such as the concurrency level and routing probabilities, and we leverage this knowledge to optimize loss. Compared to most existing studies, we account for the joint impact of heterogeneous and variable service speeds and heterogeneous datasets at the clients. We characterize in particular a fundamental trade-off for optimizing asynchronous FL: minimizing gradient estimation errors by avoiding model parameter staleness, while also speeding up the system by increasing the throughput of model updates. Our two main contributions can be summarized as follows. First, we prove a discrete variant of Little's law to derive a closed-form expression for relative delay, a metric that quantifies staleness. This allows us to efficiently minimize the average loss per model update, which has been the gold standard in literature to date, using the upper-bound of Leconte et al. as a proxy. Second, we observe that naively optimizing this metric drastically slows down the system by overemphasizing staleness at the expense of throughput. This motivates us to introduce an alternative metric that also accounts for speed, for which we derive a tractable upper-bound that can be minimized numerically. Extensive numerical results show these optimizations enhance accuracy by 10% to 30%.

Paper Structure

This paper contains 47 sections, 13 theorems, 58 equations, 11 figures, 2 algorithms.

Key Result

Proposition 1.1

In the framework of sec:fl, the sequence $(X_t, t \in \mathbb{N})$ defines an irreducible positive recurrent Markov chain with stationary distribution where the normalizing constant $Z_{n, m-1}$ can be computed by applying Buzen's recursive algorithm:

Figures (11)

  • Figure 1: Third term of the bound $G$ given in \ref{['eq:G']} vs. the routing probability to the slowest client, in a toy example with $n = 2$ clients and $m = 20$ tasks, for various speed vectors $\mu=\left(\mu_s, \mu_f \right)$.
  • Figure 2: Performance on the test set at the in the scenario of \ref{['num:optimize-updates']}, with $n = 20$ clients and $m = 100$ tasks, under homogeneous and heterogeneous data distributions. Solid lines show averages over independent runs; shaded areas denote standard deviations. For Fashion-MNIST, we ran 10 simulations of 3,000 rounds, recording accuracy and loss every 5 rounds. For CIFAR-10 and CIFAR-100, we applied standard normalization and data augmentation, and ran 3 simulations of 30,000 rounds, logging performance every 50 rounds. All routing strategies use the same model initialization.
  • Figure 3: Bound $H(p^{\text{uniform}})$ as a function of the number $m$ of tasks for different values of the step size $\eta$. The system consists of 50 clients, with speeds given by $\mu_i=\exp(i/100)$ for each $i \in \{1,\ldots,n\}$.
  • Figure 4: Performance on the test set with respect to wall-clock time at the in the scenario of \ref{['num:optimize-time']}, with $n = 30$ clients and $m = 30$ tasks on homogeneous and heterogeneous KMNIST datasets. Simulations ran for 3,000 wall-clock time units and were repeated 10 times. Solid lines show means; shaded areas indicate standard deviations.
  • Figure 5: Performance on the test set at the in the scenario of \ref{['num:optimize-updates']}, with $n = 20$ clients and $m = 100$ tasks under highly heterogeneous data splits. For Fashion-MNIST, we simulated training over 3,000 rounds, repeated 10 times, recording accuracy and loss every 5 rounds. For CIFAR-10 and CIFAR-100, we applied standard normalization and data augmentation, ran each simulation for 30,000 rounds, and repeated it three times, logging metrics every 50 rounds on an unseen test set. Solid lines show metrics averaged over independent runs; shaded areas represent standard deviations.
  • ...and 6 more figures

Theorems & Definitions (23)

  • Proposition 1.1
  • proof
  • Theorem 2.1
  • proof
  • Proposition 3.1
  • proof
  • Proposition 3.2: Time to achieve an $\epsilon$-accuracy
  • proof
  • Lemma D.1
  • proof
  • ...and 13 more