A New Theoretical Perspective on Data Heterogeneity in Federated Optimization

Jiayi Wang; Shiqiang Wang; Rong-Rong Chen; Mingyue Ji

A New Theoretical Perspective on Data Heterogeneity in Federated Optimization

Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji

TL;DR

The paper addresses the mismatch between theory and practice in federated optimization by introducing the heterogeneity-driven pseudo-Lipschitz constant $L_h$ and the global Lipschitz gradient $L_g$ as weaker, more informative metrics than the local Lipschitz constant. It derives convergence bounds for FedAvg and extensions under nonconvex objectives, showing that $L_h$ and $L_g$ can significantly reduce the impact of data heterogeneity on the convergence error, especially when many local updates are used. The results reveal that there exists a region where FedAvg can outperform mini-batch SGD even with arbitrarily large gradient divergence, and they are validated by experiments across MNIST, CIFAR-10/100, and CINIC-10. Overall, the work bridges theory and practice by reframing data heterogeneity through $L_h$, enabling tighter bounds and practical guidance for local-update strategies in FL.

Abstract

In federated learning (FL), data heterogeneity is the main reason that existing theoretical analyses are pessimistic about the convergence rate. In particular, for many FL algorithms, the convergence rate grows dramatically when the number of local updates becomes large, especially when the product of the gradient divergence and local Lipschitz constant is large. However, empirical studies can show that more local updates can improve the convergence rate even when these two parameters are large, which is inconsistent with the theoretical findings. This paper aims to bridge this gap between theoretical understanding and practical performance by providing a theoretical analysis from a new perspective on data heterogeneity. In particular, we propose a new and weaker assumption compared to the local Lipschitz gradient assumption, named the heterogeneity-driven pseudo-Lipschitz assumption. We show that this and the gradient divergence assumptions can jointly characterize the effect of data heterogeneity. By deriving a convergence upper bound for FedAvg and its extensions, we show that, compared to the existing works, local Lipschitz constant is replaced by the much smaller heterogeneity-driven pseudo-Lipschitz constant and the corresponding convergence upper bound can be significantly reduced for the same number of local updates, although its order stays the same. In addition, when the local objective function is quadratic, more insights on the impact of data heterogeneity can be obtained using the heterogeneity-driven pseudo-Lipschitz constant. For example, we can identify a region where FedAvg can outperform mini-batch SGD even when the gradient divergence can be arbitrarily large. Our findings are validated using experiments.

A New Theoretical Perspective on Data Heterogeneity in Federated Optimization

TL;DR

The paper addresses the mismatch between theory and practice in federated optimization by introducing the heterogeneity-driven pseudo-Lipschitz constant

and the global Lipschitz gradient

as weaker, more informative metrics than the local Lipschitz constant. It derives convergence bounds for FedAvg and extensions under nonconvex objectives, showing that

and

can significantly reduce the impact of data heterogeneity on the convergence error, especially when many local updates are used. The results reveal that there exists a region where FedAvg can outperform mini-batch SGD even with arbitrarily large gradient divergence, and they are validated by experiments across MNIST, CIFAR-10/100, and CINIC-10. Overall, the work bridges theory and practice by reframing data heterogeneity through

, enabling tighter bounds and practical guidance for local-update strategies in FL.

Abstract

Paper Structure (34 sections, 23 theorems, 215 equations, 9 figures, 2 tables, 3 algorithms)

This paper contains 34 sections, 23 theorems, 215 equations, 9 figures, 2 tables, 3 algorithms.

Introduction
Related Work
Preliminaries
Main Results
Discussions
Properties and Advantages of $L_h$
Analysis for Quadratic Objective Functions
Experiments
Conclusion
Additional Discussions
Additional Details of Related Work
Additional Results for Quadratic Objective Functions
Applying $L_h$ and $L_g$ in the Analysis for FedAvg with Momentum
Applying $L_h$ and $L_g$ in the Analysis for FedAdam
Applying $L_h$ and $L_g$ in the Analysis for Strongly Convex Objective Functions
...and 19 more sections

Key Result

Theorem 4.3

Assuming Assumptions assumption:bounded-stochastic-variance, assumption:bounded-gradient-divergence, assumption:global-lipschitz-gradient, assumption:gradient-to-model hold, when $\gamma\eta \le \frac{1}{2IL_g}$ and $\gamma\le \min\{ \frac{1}{2\sqrt{30}IL_g}, \frac{1}{\sqrt{6(L_h^2+L_g^2)}I}\}$, aft where $[R]:=\{0,1,\ldots,R-1\}$ in this paper.

Figures (9)

Figure 1: An illustrative comparison between local updates and centralized updates. $\bar{\mathbf{x}}^r$ is the global model at $r$th round. The local models after $k$ local iterations at the $r$th round are denoted by $\mathbf{x}_1^{r,k}$ and $\mathbf{x}_2^{r,k}$. The average of $\mathbf{x}_1^{r,k}$ and $\mathbf{x}_2^{r,k}$ is $\hat{\mathbf{x}}^{r,k}$. The centralized model after $k$ centralized iterations is denoted by $\mathbf{x}_c^{r,k}$. It can be seen that $\zeta$ shows the difference between $\mathbf{x}_c^{r,k}$ and $\mathbf{x}_i^{r,k},i=1,2$ and $L_h$ shows the difference between $\mathbf{x}_c^{r,k}$ and $\hat{\mathbf{x}}^{r,k}$.
Figure 2: Results for CNN with CIFAR-10 and MLP with MNIST. For CNN, the learning rates are chosen as $\eta=2$ and $\gamma=0.05$. For MLP, the learning rates are chosen as $\eta=2$ and $\gamma=0.1$. Results for CNN are shown in (a) and (b). Results for $75\%$ of MNIST are shown in (c) and (d).
Figure 3.1: The results of test accuracy for CNN with CIFAR-10 and MLP with MNIST, which are corresponding to the setting in Figure \ref{['fig:cnn-mlp']}.
Figure 3.2: Results with CIFAR-10 dataset. The model is VGG-11. The percentage of heterogeneous data is $50\%$. The learning rates are chosen as $\eta=2$ and $\gamma= 0.01$.
Figure 3.3: Results with CIFAR-10 dataset. The model is VGG-11. The percentage of heterogeneous data is $75\%$. The learning rates are chosen as $\eta=2$ and $\gamma= 0.01$.
...and 4 more figures

Theorems & Definitions (24)

Theorem 4.3: General Non-convex Objective Functions
Corollary 4.4
Theorem 4.5: Partial Participation
Corollary 4.6
Proposition 5.1
Proposition 5.2
Proposition 5.3
Proposition 5.4
Theorem 5.5: Special Case of $L_h=0$
Corollary 5.6
...and 14 more

A New Theoretical Perspective on Data Heterogeneity in Federated Optimization

TL;DR

Abstract

A New Theoretical Perspective on Data Heterogeneity in Federated Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (24)