A New Theoretical Perspective on Data Heterogeneity in Federated Optimization
Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji
TL;DR
The paper addresses the mismatch between theory and practice in federated optimization by introducing the heterogeneity-driven pseudo-Lipschitz constant $L_h$ and the global Lipschitz gradient $L_g$ as weaker, more informative metrics than the local Lipschitz constant. It derives convergence bounds for FedAvg and extensions under nonconvex objectives, showing that $L_h$ and $L_g$ can significantly reduce the impact of data heterogeneity on the convergence error, especially when many local updates are used. The results reveal that there exists a region where FedAvg can outperform mini-batch SGD even with arbitrarily large gradient divergence, and they are validated by experiments across MNIST, CIFAR-10/100, and CINIC-10. Overall, the work bridges theory and practice by reframing data heterogeneity through $L_h$, enabling tighter bounds and practical guidance for local-update strategies in FL.
Abstract
In federated learning (FL), data heterogeneity is the main reason that existing theoretical analyses are pessimistic about the convergence rate. In particular, for many FL algorithms, the convergence rate grows dramatically when the number of local updates becomes large, especially when the product of the gradient divergence and local Lipschitz constant is large. However, empirical studies can show that more local updates can improve the convergence rate even when these two parameters are large, which is inconsistent with the theoretical findings. This paper aims to bridge this gap between theoretical understanding and practical performance by providing a theoretical analysis from a new perspective on data heterogeneity. In particular, we propose a new and weaker assumption compared to the local Lipschitz gradient assumption, named the heterogeneity-driven pseudo-Lipschitz assumption. We show that this and the gradient divergence assumptions can jointly characterize the effect of data heterogeneity. By deriving a convergence upper bound for FedAvg and its extensions, we show that, compared to the existing works, local Lipschitz constant is replaced by the much smaller heterogeneity-driven pseudo-Lipschitz constant and the corresponding convergence upper bound can be significantly reduced for the same number of local updates, although its order stays the same. In addition, when the local objective function is quadratic, more insights on the impact of data heterogeneity can be obtained using the heterogeneity-driven pseudo-Lipschitz constant. For example, we can identify a region where FedAvg can outperform mini-batch SGD even when the gradient divergence can be arbitrarily large. Our findings are validated using experiments.
