Table of Contents
Fetching ...

Asynchronous Federated Optimization

Cong Xie, Sanmi Koyejo, Indranil Gupta

TL;DR

This work tackles scalability and straggler issues in federated learning by introducing FedAsync, an asynchronous optimization framework that solves regularized local objectives and updates the global model via adaptive mixing to mitigate staleness. The authors prove convergence for a restricted class of non-convex problems under standard smoothness and delay assumptions, and demonstrate through experiments on CIFAR-10 and WikiText-2 that FedAsync achieves fast convergence and robustness to stale updates, often outperforming synchronous FedAvg. The combination of a regularized local objective, adaptive mixing, and asynchronous server–worker communication offers practical gains for large-scale, non-IID federated setups, with future work focusing on refining adaptive strategies. These insights are relevant for developers and researchers seeking scalable, robust federated optimization algorithms capable of handling heterogeneous devices and network conditions.

Abstract

Federated learning enables training on a massive number of edge devices. To improve flexibility and scalability, we propose a new asynchronous federated optimization algorithm. We prove that the proposed approach has near-linear convergence to a global optimum, for both strongly convex and a restricted family of non-convex problems. Empirical results show that the proposed algorithm converges quickly and tolerates staleness in various applications.

Asynchronous Federated Optimization

TL;DR

This work tackles scalability and straggler issues in federated learning by introducing FedAsync, an asynchronous optimization framework that solves regularized local objectives and updates the global model via adaptive mixing to mitigate staleness. The authors prove convergence for a restricted class of non-convex problems under standard smoothness and delay assumptions, and demonstrate through experiments on CIFAR-10 and WikiText-2 that FedAsync achieves fast convergence and robustness to stale updates, often outperforming synchronous FedAvg. The combination of a regularized local objective, adaptive mixing, and asynchronous server–worker communication offers practical gains for large-scale, non-IID federated setups, with future work focusing on refining adaptive strategies. These insights are relevant for developers and researchers seeking scalable, robust federated optimization algorithms capable of handling heterogeneous devices and network conditions.

Abstract

Federated learning enables training on a massive number of edge devices. To improve flexibility and scalability, we propose a new asynchronous federated optimization algorithm. We prove that the proposed approach has near-linear convergence to a global optimum, for both strongly convex and a restricted family of non-convex problems. Empirical results show that the proposed algorithm converges quickly and tolerates staleness in various applications.

Paper Structure

This paper contains 13 sections, 2 theorems, 12 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 5

Assume that $F$ is $L$-smooth and $\mu$-weakly convex, and each worker executes at least $H_{min}$ and at most $H_{max}$ local updates before pushing models to the server. We assume bounded delay $t-\tau \leq K$. The imbalance ratio of local updates is $\delta = \frac{H_{max}}{H_{min}}$. Furthermore

Figures (4)

  • Figure 1: System overview. 0: scheduler triggers training through coordinator. 1, 2: worker receives model $x_{t-\tau}$ from server via coordinator. 3: worker computes local updates as Algorithm \ref{['alg:fed_async']}. Worker can switch between the two states: working and idle. 4, 5, 6: worker pushes the locally updated model to server via the coordinator. Coordinator queues the models received in 5, and feeds them to the updater sequentially in 6. 7, 8: server updates the global model and makes it ready to read in the coordinator. In our system, 1 and 5 operate asynchronously in parallel.
  • Figure 2: Top-1 accuracy (the higher the better) vs. # of gradients on CNN and CIFAR-10 dataset. The maximum staleness is $4$ or $16$. $\gamma = 0.1$, $\rho = 0.005$. For FedAsync+Poly, we take $a=0.5$. For FedAsync+Hinge, we take $a=10, b=4$. Note that when the maximum staleness is $4$, FedAsync+Const and FedAsync+Hinge with $b=4$ are the same.
  • Figure 3: Perplexity (the lower the better) vs. # of gradients on LSTM-based language model and WikiText-2 dataset. The maximum staleness is $4$ or $16$. $\gamma = 20$, $\rho = 0.0001$. For FedAsync+Poly, we take $a=0.5$. For FedAsync+Hinge, we take $a=10, b=2$.
  • Figure 4: Top-1 accuracy on CNN and CIFAR-10 dataset at the end of training, with different staleness. $\gamma = 0.1$, $\rho = 0.01$. $\alpha$ has initial value $0.9$.

Theorems & Definitions (6)

  • Remark 1
  • Remark 2
  • Definition 3
  • Definition 4
  • Theorem 5
  • Theorem 1