Table of Contents
Fetching ...

On the Convergence of FedAvg on Non-IID Data

Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, Zhihua Zhang

TL;DR

<3-5 sentence high-level summary> FedAvg enables distributed training across many devices but faces challenges from slow communication, stragglers, and non-IID data. This paper provides the first comprehensive convergence guarantees for FedAvg under non-IID data and partial device participation, showing an ${\mathcal{O}}(1/T)$ rate for strongly convex and smooth objectives, with explicit dependence on local steps E, participation level K, and data heterogeneity. It reveals that data heterogeneity slows convergence, and that learning-rate decay is necessary even in full-gradient-like scenarios; it also analyzes sampling/averaging schemes to mitigate these effects and validates results with numerical experiments. The work informs design choices for practical federated optimization, highlighting how to balance communication efficiency and convergence in heterogeneous environments.

Abstract

Federated learning enables a large amount of edge computing devices to jointly learn a model without data sharing. As a leading algorithm in this setting, Federated Averaging (\texttt{FedAvg}) runs Stochastic Gradient Descent (SGD) in parallel on a small subset of the total devices and averages the sequences only once in a while. Despite its simplicity, it lacks theoretical guarantees under realistic settings. In this paper, we analyze the convergence of \texttt{FedAvg} on non-iid data and establish a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs. Importantly, our bound demonstrates a trade-off between communication-efficiency and convergence rate. As user devices may be disconnected from the server, we relax the assumption of full device participation to partial device participation and study different averaging schemes; low device participation rate can be achieved without severely slowing down the learning. Our results indicate that heterogeneity of data slows down the convergence, which matches empirical observations. Furthermore, we provide a necessary condition for \texttt{FedAvg} on non-iid data: the learning rate $η$ must decay, even if full-gradient is used; otherwise, the solution will be $Ω(η)$ away from the optimal.

On the Convergence of FedAvg on Non-IID Data

TL;DR

<3-5 sentence high-level summary> FedAvg enables distributed training across many devices but faces challenges from slow communication, stragglers, and non-IID data. This paper provides the first comprehensive convergence guarantees for FedAvg under non-IID data and partial device participation, showing an rate for strongly convex and smooth objectives, with explicit dependence on local steps E, participation level K, and data heterogeneity. It reveals that data heterogeneity slows convergence, and that learning-rate decay is necessary even in full-gradient-like scenarios; it also analyzes sampling/averaging schemes to mitigate these effects and validates results with numerical experiments. The work informs design choices for practical federated optimization, highlighting how to balance communication efficiency and convergence in heterogeneous environments.

Abstract

Federated learning enables a large amount of edge computing devices to jointly learn a model without data sharing. As a leading algorithm in this setting, Federated Averaging (\texttt{FedAvg}) runs Stochastic Gradient Descent (SGD) in parallel on a small subset of the total devices and averages the sequences only once in a while. Despite its simplicity, it lacks theoretical guarantees under realistic settings. In this paper, we analyze the convergence of \texttt{FedAvg} on non-iid data and establish a convergence rate of for strongly convex and smooth problems, where is the number of SGDs. Importantly, our bound demonstrates a trade-off between communication-efficiency and convergence rate. As user devices may be disconnected from the server, we relax the assumption of full device participation to partial device participation and study different averaging schemes; low device participation rate can be achieved without severely slowing down the learning. Our results indicate that heterogeneity of data slows down the convergence, which matches empirical observations. Furthermore, we provide a necessary condition for \texttt{FedAvg} on non-iid data: the learning rate must decay, even if full-gradient is used; otherwise, the solution will be away from the optimal.

Paper Structure

This paper contains 59 sections, 11 theorems, 73 equations, 4 figures, 2 tables.

Key Result

Theorem 1

Let Assumptions asm:smooth to asm:sgd_norm hold and $L, \mu, \sigma_k, G$ be defined therein. Choose $\kappa = \frac{L}{\mu}$, $\gamma = \max\{8\kappa, E\}$ and the learning rate $\eta_t = \frac{2}{\mu (\gamma+t)}$. Then $\texttt{FedAvg}$ with full device participation satisfies where

Figures (4)

  • Figure 1: (a) To obtain an $\epsilon$ accuracy, the required rounds first decrease and then increase when we increase the local steps $E$. (b) In Synthetic(0,0) dataset, decreasing the numbers of active devices each round has little effect on the convergence process. (c) In mnist balanced dataset, Scheme I slightly outperforms Scheme II. They both performs better than the original scheme. Here transformed Scheme II coincides with Scheme II due to the balanced data. (d) In mnist unbalanced dataset, Scheme I performs better than Scheme II and the original scheme. Scheme II suffers from instability while transformed Scheme II has a lower convergence rate.
  • Figure 2: The left figure shows that the global objective value that FedAvg converges to is not optimal unless $E=1$. Once we decay the learning rate, FedAvg can converge to the optimal even if $E > 1$.
  • Figure 3: The impact of $K$ on four datasets. To show more clearly the differences between the curves, we zoom in the last few rounds in the upper left corner of the box.
  • Figure 4: The performance of four schemes on two synthetic datasets. The Scheme I performs stably and the best. The original performs the second. The curve of the Scheme II fluctuates and has no sign of convergence. Transformed Scheme II has a lower convergence rate than Scheme I.

Theorems & Definitions (20)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Lemma 1: Results of one step SGD
  • Lemma 2: Bounding the variance
  • Lemma 3: Bounding the divergence of $\{{\bf w}_t^k\}$
  • proof
  • proof : Proof of Lemma \ref{['lem:conv_main']}.
  • proof : Proof of Lemma \ref{['lem:conv_variance']}
  • ...and 10 more