Table of Contents
Fetching ...

Why Federated Optimization Fails to Achieve Perfect Fitting? A Theoretical Perspective on Client-Side Optima

Zhongxiang Lei, Qi Yang, Ping Qiu, Gang Zhang, Yuanchi Ma, Jinyan Liu

TL;DR

This work addresses why federated optimization can fail to perfectly fit heterogeneous client data by introducing the assumption of heterogeneous local optima and deriving a lower bound on the global objective that grows with local-optima dispersion. It further characterizes an oscillatory convergence region near the end of training and analyzes three federated-method families (LA, DC, SA), providing an LA-FedAVG trajectory theorem, drift-correction conditions, and SA behavior under heterogeneity. The theoretical results are supported by experiments across diverse neural architectures (GRU, ResNet-18, ViT, DeepSeek) and datasets, and the authors provide an open-source FedTorch framework for replication. The findings offer practical guidance on choosing local update counts, participation rates, and correction strategies to mitigate underfitting in non-iid settings, with broad implications for federated learning practice.

Abstract

Federated optimization is a constrained form of distributed optimization that enables training a global model without directly sharing client data. Although existing algorithms can guarantee convergence in theory and often achieve stable training in practice, the reasons behind performance degradation under data heterogeneity remain unclear. To address this gap, the main contribution of this paper is to provide a theoretical perspective that explains why such degradation occurs. We introduce the assumption that heterogeneous client data lead to distinct local optima, and show that this assumption implies two key consequences: 1) the distance among clients' local optima raises the lower bound of the global objective, making perfect fitting of all client data impossible; and 2) in the final training stage, the global model oscillates within a region instead of converging to a single optimum, limiting its ability to fully fit the data. These results provide a principled explanation for performance degradation in non-iid settings, which we further validate through experiments across multiple tasks and neural network architectures. The framework used in this paper is open-sourced at: https://github.com/NPCLEI/fedtorch.

Why Federated Optimization Fails to Achieve Perfect Fitting? A Theoretical Perspective on Client-Side Optima

TL;DR

This work addresses why federated optimization can fail to perfectly fit heterogeneous client data by introducing the assumption of heterogeneous local optima and deriving a lower bound on the global objective that grows with local-optima dispersion. It further characterizes an oscillatory convergence region near the end of training and analyzes three federated-method families (LA, DC, SA), providing an LA-FedAVG trajectory theorem, drift-correction conditions, and SA behavior under heterogeneity. The theoretical results are supported by experiments across diverse neural architectures (GRU, ResNet-18, ViT, DeepSeek) and datasets, and the authors provide an open-source FedTorch framework for replication. The findings offer practical guidance on choosing local update counts, participation rates, and correction strategies to mitigate underfitting in non-iid settings, with broad implications for federated learning practice.

Abstract

Federated optimization is a constrained form of distributed optimization that enables training a global model without directly sharing client data. Although existing algorithms can guarantee convergence in theory and often achieve stable training in practice, the reasons behind performance degradation under data heterogeneity remain unclear. To address this gap, the main contribution of this paper is to provide a theoretical perspective that explains why such degradation occurs. We introduce the assumption that heterogeneous client data lead to distinct local optima, and show that this assumption implies two key consequences: 1) the distance among clients' local optima raises the lower bound of the global objective, making perfect fitting of all client data impossible; and 2) in the final training stage, the global model oscillates within a region instead of converging to a single optimum, limiting its ability to fully fit the data. These results provide a principled explanation for performance degradation in non-iid settings, which we further validate through experiments across multiple tasks and neural network architectures. The framework used in this paper is open-sourced at: https://github.com/NPCLEI/fedtorch.

Paper Structure

This paper contains 36 sections, 11 theorems, 44 equations, 8 figures, 3 tables, 2 algorithms.

Key Result

Theorem 3.1

(Lower Bound of Objective Function) If $f(x; D_i)$ can be approximated as a convex function within the neighborhood $U_i(x_i^*)$ around $x_i^*$, then $\nabla f_i = 0$ and $\nabla f_i^*$ is positive definite. For all $x \in U_0 \cap U_1 \cap \ldots$, the lower bound of Eq. eq:target_fl is: where $\lambda_{\text{min}}^i > 0$ is the smallest eigenvalue of $\nabla^2 f_i(x_i^* + t (x - x_i^*)) , t \in

Figures (8)

  • Figure 1: The performance of FedAVG, FedAVGM cheng2024momentum, DeltaSGD kim2023adaptive, FedRed jiang2024federated, FedEXP jhunjhunwalafedexp, FedGM sun2024role, FedInit sun2023understanding, FedLESAM qu2022generalized, FedNAR li2023fednar, FedPROX li2020federated, SCAFFOLD karimireddy2020scaffold, SCAFFNEW mishchenko2022proxskip, and FedADAM reddi2020adaptive on the EMNIST classification task. These experiments demonstrate a common phenomenon: These algorithms converge to the stationary point, but their final performance deteriorates due to increased heterogeneity. Where $\alpha$ is the Dirichlet distribution parameter, commonly used to simulate heterogeneity.
  • Figure 2: The two figures on the left and right support the view that the local optimal points of the clients is heterogeneous. The left figure merges the loss landscapes of all $f_i$ into a single plot using the aggregation method $g(x) = \min (f_1(x), f_2(x), f_{\dots}(x))$. The right figure demonstrates the changes in the relative positions of local optimal points (approximation) under different tasks and rounds.
  • Figure 3: The lower bound of Eq. \ref{['eq:target_fl']}(F) will be pulled up by distant of local optimal points.
  • Figure 4: The left four diagrams represent the contour plots of paraboloid surface $f$, where the green points indicate the local optimal point of $f_i$ obtained by sampling using both multivariate Gaussian and Laplace distributions. The blue line shows the optimization trajectory of $x_t$, while the red dashed lines outline the oscillatory region. In the upper right corner, the left plot shows the communication rounds on the horizontal axis and the average value of $\cos \left\langle \delta_{t,i}, \delta_{t,j} \right\rangle$ on the vertical axis. The two contour plots in the upper right corner explain the effect of the positive definiteness of $A$ on the range of quadratic forms. The white areas indicate regions where the values are less than zero, the red vectors represent $X_{\mathcal{S}_t}$ before being deflected by $\mathcal{P}_{\mathcal{S}_t}$, and the blue vectors represent the deflected ones. The lower right section presents the trajectories for $P$ ranging from 0.01% to 0.09%.
  • Figure 5: The performance of FedAVG, FedAVGM, DeltaSGD, DualPROX, FedEXP, FedGM, FedInit, FedLESAM, FedNAR, FedPROX, SCAFFOLD, SCAFFNEW, and FedADAM on the EMNIST classification task.
  • ...and 3 more figures

Theorems & Definitions (18)

  • Theorem 3.1
  • proof
  • Theorem 4.1
  • Lemma 4.2
  • Theorem 4.3
  • Corollary 4.4
  • proof
  • Theorem 4.5
  • Theorem D.1
  • proof
  • ...and 8 more