Table of Contents
Fetching ...

Queuing dynamics of asynchronous Federated Learning

Louis Leconte, Matthieu Jonckheere, Sergey Samsonov, Eric Moulines

TL;DR

This work addresses the inefficiencies of asynchronous federated learning under heterogeneous client speeds by modeling the system's queuing dynamics with a closed Jackson network. It introduces Generalized AsyncSGD, a non-uniform sampling strategy that leverages queuing insights to reduce server delays while preserving unbiased gradients, and provides convergence bounds in the non-convex setting. Theoretical analysis connects queueing statistics to optimization progress and offers asymptotic insights under heavy load, while extensive image-classification experiments (CIFAR-10, TinyImageNet) demonstrate substantial empirical gains over state-of-the-art asynchronous baselines. The approach enables more scalable, efficient FL in real-world networks where server and client speeds vary widely, with a transparent link between network dynamics and learning performance.

Abstract

We study asynchronous federated learning mechanisms with nodes having potentially different computational speeds. In such an environment, each node is allowed to work on models with potential delays and contribute to updates to the central server at its own pace. Existing analyses of such algorithms typically depend on intractable quantities such as the maximum node delay and do not consider the underlying queuing dynamics of the system. In this paper, we propose a non-uniform sampling scheme for the central server that allows for lower delays with better complexity, taking into account the closed Jackson network structure of the associated computational graph. Our experiments clearly show a significant improvement of our method over current state-of-the-art asynchronous algorithms on an image classification problem.

Queuing dynamics of asynchronous Federated Learning

TL;DR

This work addresses the inefficiencies of asynchronous federated learning under heterogeneous client speeds by modeling the system's queuing dynamics with a closed Jackson network. It introduces Generalized AsyncSGD, a non-uniform sampling strategy that leverages queuing insights to reduce server delays while preserving unbiased gradients, and provides convergence bounds in the non-convex setting. Theoretical analysis connects queueing statistics to optimization progress and offers asymptotic insights under heavy load, while extensive image-classification experiments (CIFAR-10, TinyImageNet) demonstrate substantial empirical gains over state-of-the-art asynchronous baselines. The approach enables more scalable, efficient FL in real-world networks where server and client speeds vary widely, with a transparent link between network dynamics and learning performance.

Abstract

We study asynchronous federated learning mechanisms with nodes having potentially different computational speeds. In such an environment, each node is allowed to work on models with potential delays and contribute to updates to the central server at its own pace. Existing analyses of such algorithms typically depend on intractable quantities such as the maximum node delay and do not consider the underlying queuing dynamics of the system. In this paper, we propose a non-uniform sampling scheme for the central server that allows for lower delays with better complexity, taking into account the closed Jackson network structure of the associated computational graph. Our experiments clearly show a significant improvement of our method over current state-of-the-art asynchronous algorithms on an image classification problem.
Paper Structure (35 sections, 11 theorems, 85 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 35 sections, 11 theorems, 85 equations, 12 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Assume assum:uniflowerbound to assum:graddissim and let the learning rate $\eta$ satisfy $\eta \leq \eta_{\max}(\mathbf{p})$, where Then Generalized AsyncSGD converges at rate: where $B=2 G^2 + \sigma^2$.

Figures (12)

  • Figure 1: Evolution of $\m$ w.r.t. $k$, for two networks of size $n=10, 50$ initialized with full concurrency.
  • Figure 2: Optimal sampling probability $p$ as a function of the speed for different concurrency levels. The number of nodes is fixed to $n=100$ nodes.
  • Figure 3: Relative improvements of the upper bounds as a function of the speed for different concurrency levels. The number of nodes is fixed to $n=100$ nodes.
  • Figure 4: Relative improvement of Generalized AsyncSGD over FedBuff and AsyncSGD as a function of speed. The number of nodes is fixed to $n=100$ nodes.
  • Figure 5: Histogram of fast and slow delays (in number of server steps) for a uniform sampling scheme.
  • ...and 7 more figures

Theorems & Definitions (16)

  • Theorem 1
  • Proposition 1
  • Proposition 2
  • Proposition 3: Corollary.2 in van2021scaling
  • Proposition 4
  • Lemma 1
  • proof
  • Remark 1
  • Lemma 2
  • proof
  • ...and 6 more