On the Convergence of FedAvg on Non-IID Data
Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, Zhihua Zhang
TL;DR
<3-5 sentence high-level summary> FedAvg enables distributed training across many devices but faces challenges from slow communication, stragglers, and non-IID data. This paper provides the first comprehensive convergence guarantees for FedAvg under non-IID data and partial device participation, showing an ${\mathcal{O}}(1/T)$ rate for strongly convex and smooth objectives, with explicit dependence on local steps E, participation level K, and data heterogeneity. It reveals that data heterogeneity slows convergence, and that learning-rate decay is necessary even in full-gradient-like scenarios; it also analyzes sampling/averaging schemes to mitigate these effects and validates results with numerical experiments. The work informs design choices for practical federated optimization, highlighting how to balance communication efficiency and convergence in heterogeneous environments.
Abstract
Federated learning enables a large amount of edge computing devices to jointly learn a model without data sharing. As a leading algorithm in this setting, Federated Averaging (\texttt{FedAvg}) runs Stochastic Gradient Descent (SGD) in parallel on a small subset of the total devices and averages the sequences only once in a while. Despite its simplicity, it lacks theoretical guarantees under realistic settings. In this paper, we analyze the convergence of \texttt{FedAvg} on non-iid data and establish a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs. Importantly, our bound demonstrates a trade-off between communication-efficiency and convergence rate. As user devices may be disconnected from the server, we relax the assumption of full device participation to partial device participation and study different averaging schemes; low device participation rate can be achieved without severely slowing down the learning. Our results indicate that heterogeneity of data slows down the convergence, which matches empirical observations. Furthermore, we provide a necessary condition for \texttt{FedAvg} on non-iid data: the learning rate $η$ must decay, even if full-gradient is used; otherwise, the solution will be $Ω(η)$ away from the optimal.
