Table of Contents
Fetching ...

An Operator Splitting View of Federated Learning

Saber Malekmohammadi, Kiarash Shaloudegi, Zeou Hu, Yaoliang Yu

TL;DR

This paper recasts federated learning as an operator-splitting problem, unifying core algorithms under a single framework and clarifying how step-size and local updates influence convergence. It shows FedAvg corresponds to forward-backward splitting, FedProx to backward-backward splitting, FedSplit to Peaceman-Rachford, FedPi to Douglas-Rachford, and FedRP to Reflection-Projection, revealing new algorithmic variants and deeper connections. The authors also introduce a practical acceleration path via Anderson acceleration that speeds up convergence without adding communication overhead, and provide extensive convex and nonconvex experiments to validate the theory. The work offers a standardized, extensible view of FL algorithms, enabling streamlined implementation, comparison, and scalable acceleration across heterogeneous devices and networks.

Abstract

Over the past few years, the federated learning ($\texttt{FL}$) community has witnessed a proliferation of new $\texttt{FL}$ algorithms. However, our understating of the theory of $\texttt{FL}$ is still fragmented, and a thorough, formal comparison of these algorithms remains elusive. Motivated by this gap, we show that many of the existing $\texttt{FL}$ algorithms can be understood from an operator splitting point of view. This unification allows us to compare different algorithms with ease, to refine previous convergence results and to uncover new algorithmic variants. In particular, our analysis reveals the vital role played by the step size in $\texttt{FL}$ algorithms. The unification also leads to a streamlined and economic way to accelerate $\texttt{FL}$ algorithms, without incurring any communication overhead. We perform numerical experiments on both convex and nonconvex models to validate our findings.

An Operator Splitting View of Federated Learning

TL;DR

This paper recasts federated learning as an operator-splitting problem, unifying core algorithms under a single framework and clarifying how step-size and local updates influence convergence. It shows FedAvg corresponds to forward-backward splitting, FedProx to backward-backward splitting, FedSplit to Peaceman-Rachford, FedPi to Douglas-Rachford, and FedRP to Reflection-Projection, revealing new algorithmic variants and deeper connections. The authors also introduce a practical acceleration path via Anderson acceleration that speeds up convergence without adding communication overhead, and provide extensive convex and nonconvex experiments to validate the theory. The work offers a standardized, extensible view of FL algorithms, enabling streamlined implementation, comparison, and scalable acceleration across heterogeneous devices and networks.

Abstract

Over the past few years, the federated learning () community has witnessed a proliferation of new algorithms. However, our understating of the theory of is still fragmented, and a thorough, formal comparison of these algorithms remains elusive. Motivated by this gap, we show that many of the existing algorithms can be understood from an operator splitting point of view. This unification allows us to compare different algorithms with ease, to refine previous convergence results and to uncover new algorithmic variants. In particular, our analysis reveals the vital role played by the step size in algorithms. The unification also leads to a streamlined and economic way to accelerate algorithms, without incurring any communication overhead. We perform numerical experiments on both convex and nonconvex models to validate our findings.

Paper Structure

This paper contains 29 sections, 12 theorems, 51 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Assuming each user participates indefinitely, the step size $\eta_t$ is bounded from below (i.e.$\liminf_t \eta_t > 0$), the user functions $\{f_i\}$ are convex, and homogeneous in the sense that they have a common minimizer, i.e.$\mathsf{F} := \bigcap_i \mathop{\mathrm{argmin}}\limits_{\mathbf{w}_i

Figures (11)

  • Figure 1: When $f = \iota_C$ is the indicator function of a (closed) set $C$, $\mathsf{P}_{f}^{\eta}(\mathbf{w})$ is the Euclidean projection of $\mathbf{w}$ onto the set $C$. Similarly, $\mathsf{R}_{f}^{\eta}(\mathbf{v})$ is the Euclidean reflection of $\mathbf{v}$ w.r.t. the set $C$.
  • Figure 2: Optimality gap $\{f(\mathbf{w}_{\texttt{FedAvg}\xspace}^{*})-f^{*}\}$ or training loss $\{f(\mathbf{w}_{\texttt{FedAvg}\xspace}^{*})\}$ of (approximate) fixed-point solutions of FedAvg for different learning rates $\eta$ and local epochs $k$. Different colored lines are for different numbers of local epochs, and dashed lines for different product values $\eta(k-1)$. Left: least squares (closed-form solution); Middle: logistic regression ($6000$ communication rounds); Right: nonconvex CNN on the MNIST dataset ($200$ communication rounds).
  • Figure 3: Effect of step size $\eta$ and averaging on FedProx. Left: least squares; Middle: logistic regression; Right: CNN on MNIST. The dashed and solid lines with the same color show the results obtained with and without the ergodic averaging step in \ref{['thm:fp']}, respectively. For exponentially decaying $\eta_t$, we use period $T$ equal to $500$ for both least squares and logistic regression experiments, and $10$ for CNN experiment.
  • Figure 4: The effect of data heterogeneity on the performance of different splitting methods. The top row shows the results for the least squares, and the bottom row shows the results for nonconvex CNN model. Top-Left: small data heterogeneity with $H\approx 119\times10^3$; Top-Middle: moderately data heterogeneity with $H\approx7.61\times10^{6}$; Top-Right: large data heterogeneity with $H\approx190.3\times10^{6}$. Bottom-Left: i.i.d. data distribution; Bottom-Middle: non-i.i.d. data distribution with maximum $6$ classes per user; Bottom-Right: non-i.i.d. data distribution with maximum $2$ classes per user.
  • Figure 5: Effect of Anderson acceleration. Left: least squares with $\tau=2$; Middle: logistic regression $\tau=2$; Right: nonconvex CNN with $\tau=10$. Dashed lines are the accelerated results.
  • ...and 6 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Theorem 2: Lions78Passty79
  • Theorem 3: Passty79
  • Theorem 4: YuZMX15
  • Theorem 5
  • Theorem 6: Spingarn83LionsMercier79
  • Theorem 7
  • Lemma 1: BrezisBrowder76, Bauschke2011
  • Theorem 7
  • Theorem 7: Lions78Passty79
  • ...and 3 more