Table of Contents
Fetching ...

A New Theoretical Perspective on Data Heterogeneity in Federated Optimization

Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji

TL;DR

The paper addresses the mismatch between theory and practice in federated optimization by introducing the heterogeneity-driven pseudo-Lipschitz constant $L_h$ and the global Lipschitz gradient $L_g$ as weaker, more informative metrics than the local Lipschitz constant. It derives convergence bounds for FedAvg and extensions under nonconvex objectives, showing that $L_h$ and $L_g$ can significantly reduce the impact of data heterogeneity on the convergence error, especially when many local updates are used. The results reveal that there exists a region where FedAvg can outperform mini-batch SGD even with arbitrarily large gradient divergence, and they are validated by experiments across MNIST, CIFAR-10/100, and CINIC-10. Overall, the work bridges theory and practice by reframing data heterogeneity through $L_h$, enabling tighter bounds and practical guidance for local-update strategies in FL.

Abstract

In federated learning (FL), data heterogeneity is the main reason that existing theoretical analyses are pessimistic about the convergence rate. In particular, for many FL algorithms, the convergence rate grows dramatically when the number of local updates becomes large, especially when the product of the gradient divergence and local Lipschitz constant is large. However, empirical studies can show that more local updates can improve the convergence rate even when these two parameters are large, which is inconsistent with the theoretical findings. This paper aims to bridge this gap between theoretical understanding and practical performance by providing a theoretical analysis from a new perspective on data heterogeneity. In particular, we propose a new and weaker assumption compared to the local Lipschitz gradient assumption, named the heterogeneity-driven pseudo-Lipschitz assumption. We show that this and the gradient divergence assumptions can jointly characterize the effect of data heterogeneity. By deriving a convergence upper bound for FedAvg and its extensions, we show that, compared to the existing works, local Lipschitz constant is replaced by the much smaller heterogeneity-driven pseudo-Lipschitz constant and the corresponding convergence upper bound can be significantly reduced for the same number of local updates, although its order stays the same. In addition, when the local objective function is quadratic, more insights on the impact of data heterogeneity can be obtained using the heterogeneity-driven pseudo-Lipschitz constant. For example, we can identify a region where FedAvg can outperform mini-batch SGD even when the gradient divergence can be arbitrarily large. Our findings are validated using experiments.

A New Theoretical Perspective on Data Heterogeneity in Federated Optimization

TL;DR

The paper addresses the mismatch between theory and practice in federated optimization by introducing the heterogeneity-driven pseudo-Lipschitz constant and the global Lipschitz gradient as weaker, more informative metrics than the local Lipschitz constant. It derives convergence bounds for FedAvg and extensions under nonconvex objectives, showing that and can significantly reduce the impact of data heterogeneity on the convergence error, especially when many local updates are used. The results reveal that there exists a region where FedAvg can outperform mini-batch SGD even with arbitrarily large gradient divergence, and they are validated by experiments across MNIST, CIFAR-10/100, and CINIC-10. Overall, the work bridges theory and practice by reframing data heterogeneity through , enabling tighter bounds and practical guidance for local-update strategies in FL.

Abstract

In federated learning (FL), data heterogeneity is the main reason that existing theoretical analyses are pessimistic about the convergence rate. In particular, for many FL algorithms, the convergence rate grows dramatically when the number of local updates becomes large, especially when the product of the gradient divergence and local Lipschitz constant is large. However, empirical studies can show that more local updates can improve the convergence rate even when these two parameters are large, which is inconsistent with the theoretical findings. This paper aims to bridge this gap between theoretical understanding and practical performance by providing a theoretical analysis from a new perspective on data heterogeneity. In particular, we propose a new and weaker assumption compared to the local Lipschitz gradient assumption, named the heterogeneity-driven pseudo-Lipschitz assumption. We show that this and the gradient divergence assumptions can jointly characterize the effect of data heterogeneity. By deriving a convergence upper bound for FedAvg and its extensions, we show that, compared to the existing works, local Lipschitz constant is replaced by the much smaller heterogeneity-driven pseudo-Lipschitz constant and the corresponding convergence upper bound can be significantly reduced for the same number of local updates, although its order stays the same. In addition, when the local objective function is quadratic, more insights on the impact of data heterogeneity can be obtained using the heterogeneity-driven pseudo-Lipschitz constant. For example, we can identify a region where FedAvg can outperform mini-batch SGD even when the gradient divergence can be arbitrarily large. Our findings are validated using experiments.
Paper Structure (34 sections, 23 theorems, 215 equations, 9 figures, 2 tables, 3 algorithms)

This paper contains 34 sections, 23 theorems, 215 equations, 9 figures, 2 tables, 3 algorithms.

Key Result

Theorem 4.3

Assuming Assumptions assumption:bounded-stochastic-variance, assumption:bounded-gradient-divergence, assumption:global-lipschitz-gradient, assumption:gradient-to-model hold, when $\gamma\eta \le \frac{1}{2IL_g}$ and $\gamma\le \min\{ \frac{1}{2\sqrt{30}IL_g}, \frac{1}{\sqrt{6(L_h^2+L_g^2)}I}\}$, aft where $[R]:=\{0,1,\ldots,R-1\}$ in this paper.

Figures (9)

  • Figure 1: An illustrative comparison between local updates and centralized updates. $\bar{\mathbf{x}}^r$ is the global model at $r$th round. The local models after $k$ local iterations at the $r$th round are denoted by $\mathbf{x}_1^{r,k}$ and $\mathbf{x}_2^{r,k}$. The average of $\mathbf{x}_1^{r,k}$ and $\mathbf{x}_2^{r,k}$ is $\hat{\mathbf{x}}^{r,k}$. The centralized model after $k$ centralized iterations is denoted by $\mathbf{x}_c^{r,k}$. It can be seen that $\zeta$ shows the difference between $\mathbf{x}_c^{r,k}$ and $\mathbf{x}_i^{r,k},i=1,2$ and $L_h$ shows the difference between $\mathbf{x}_c^{r,k}$ and $\hat{\mathbf{x}}^{r,k}$.
  • Figure 2: Results for CNN with CIFAR-10 and MLP with MNIST. For CNN, the learning rates are chosen as $\eta=2$ and $\gamma=0.05$. For MLP, the learning rates are chosen as $\eta=2$ and $\gamma=0.1$. Results for CNN are shown in (a) and (b). Results for $75\%$ of MNIST are shown in (c) and (d).
  • Figure 3.1: The results of test accuracy for CNN with CIFAR-10 and MLP with MNIST, which are corresponding to the setting in Figure \ref{['fig:cnn-mlp']}.
  • Figure 3.2: Results with CIFAR-10 dataset. The model is VGG-11. The percentage of heterogeneous data is $50\%$. The learning rates are chosen as $\eta=2$ and $\gamma= 0.01$.
  • Figure 3.3: Results with CIFAR-10 dataset. The model is VGG-11. The percentage of heterogeneous data is $75\%$. The learning rates are chosen as $\eta=2$ and $\gamma= 0.01$.
  • ...and 4 more figures

Theorems & Definitions (24)

  • Theorem 4.3: General Non-convex Objective Functions
  • Corollary 4.4
  • Theorem 4.5: Partial Participation
  • Corollary 4.6
  • Proposition 5.1
  • Proposition 5.2
  • Proposition 5.3
  • Proposition 5.4
  • Theorem 5.5: Special Case of $L_h=0$
  • Corollary 5.6
  • ...and 14 more