On Principled Local Optimization Methods for Federated Learning

Honglin Yuan

On Principled Local Optimization Methods for Federated Learning

Honglin Yuan

TL;DR

The paper develops a principled theory for local optimization in Federated Learning, beginning with sharp lower/upper bounds for FedAvg and introducing iterate bias as a fundamental obstacle. It then presents FedAc, a provably accelerated version of FedAvg that balances acceleration with stability to improve convergence and reduce communication, and extends to non-smooth composite objectives via FedMiD and FedDualAvg, which address the curse of primal averaging through dual-space server averaging. Across convex and non-convex settings, including third-order smoothness, the work provides strong convergence guarantees and practical insights supported by numerical experiments. The results offer a cohesive framework linking stochastic differential equation perspectives, stability analyses, and primal-dual techniques, with clear implications for more efficient, scalable, and privacy-preserving on-device FL.

Abstract

Federated Learning (FL), a distributed learning paradigm that scales on-device learning collaboratively, has emerged as a promising approach for decentralized AI applications. Local optimization methods such as Federated Averaging (FedAvg) are the most prominent methods for FL applications. Despite their simplicity and popularity, the theoretical understanding of local optimization methods is far from clear. This dissertation aims to advance the theoretical foundation of local methods in the following three directions. First, we establish sharp bounds for FedAvg, the most popular algorithm in Federated Learning. We demonstrate how FedAvg may suffer from a notion we call iterate bias, and how an additional third-order smoothness assumption may mitigate this effect and lead to better convergence rates. We explain this phenomenon from a Stochastic Differential Equation (SDE) perspective. Second, we propose Federated Accelerated Stochastic Gradient Descent (FedAc), the first principled acceleration of FedAvg, which provably improves the convergence rate and communication efficiency. Our technique uses on a potential-based perturbed iterate analysis, a novel stability analysis of generalized accelerated SGD, and a strategic tradeoff between acceleration and stability. Third, we study the Federated Composite Optimization problem, which extends the classic smooth setting by incorporating a shared non-smooth regularizer. We show that direct extensions of FedAvg may suffer from the "curse of primal averaging," resulting in slow convergence. As a solution, we propose a new primal-dual algorithm, Federated Dual Averaging, which overcomes the curse of primal averaging by employing a novel inter-client dual averaging procedure.

On Principled Local Optimization Methods for Federated Learning

TL;DR

Abstract

Paper Structure (187 sections, 98 theorems, 564 equations, 24 figures, 2 tables, 8 algorithms)

This paper contains 187 sections, 98 theorems, 564 equations, 24 figures, 2 tables, 8 algorithms.

Introduction
Sharp Bounds for Federated Averaging and Continuous Perspective
Principled Acceleration of Federated Averaging
Federated Composite Optimization
Additional Related Work
Notations
Sharp Bounds for Federated Averaging and Continuous Perspective
Preliminaries
Interpretation of \ref{['thm:fedavg:2o:ub']}
Review of FedAvg Upper Bound Analysis
Iterate Bias of SGD
Proof Sketch of \ref{['thm:fedavg:2o:bias:lb']}
Lower Bound of FedAvg
Proof of \ref{['thm:fedavg:lb:homo', 'thm:fedavg:lb:hetero']}
The Benefit of Third-Order Smoothness
...and 172 more sections

Key Result

proposition 2.1

Consider the model problem eq:fo:hetero and assume asm:fo:2oasm:fedavg:hetero. Consider running FedAvg with $M$ clients, $R$ rounds and $K$ steps per round, starting from $\mathbf{x}^{(0,0)}$. Then there exists a step-size $\eta$ such that FedAvg yields Particularly when asm:fedavg:homo holds, the RHS of eq:thm:fedavg:2o:ub becomes $\mathop{\mathrm{\mathcal{O}}}\nolimits(\text{①}+\text{②}+\text{③

Figures (24)

Figure 1: Illustration of "curse of primal averaging". While each client of FedMiD can locate a sparse solution, simply averaging them will yield a much denser solution on the server side.
Figure 2: Illustration of the iterate bias of SGD. Consider the objective $F(x) = x^2x \geq 0\frac{1}{10} x^2x < 0$ as shown in (a), and $f(x; \xi) := \xi x + F(x)$ where $\xi \sim \mathcal{N}(0, 0.01)$. We initialize the SGD at optimum $x^{\star}=0$, and run 1024 steps of SGD with step size $10^{-2}$. We repeat this random process for 65536 times, and estimate the density function after 128, 256, 512 and 1024 steps. Observe that the density function and the average gradually move to the left (away from the optimum, where the curvature is smaller). This figure explains the intrinsic difficulty for FedAvg to handle objective with drastic Hessian change.
Figure 3: Observed linear speedup with respect to the number of clients $M$ under various synchronization intervals $K$. Our FedAc is tested against three baselines FedAvg, Mb-Sgd, and Mb-Ac-Sgd. While all four algorithms attain linear speedup for the fully synchronized ($K=1$) setting, FedAvg and Mb-Sgd lose linear speedup for $K$ as low as 8. Mb-Ac-Sgd is comparably better than the other two baselines but still deteriorates significantly for $K \geq 64$. FedAc is most robust to infrequent synchronization and outperforms the baselines by a margin for $K \geq 64$.
Figure 4: FedAc versus baselines on the dependency of synchronization interval $K$ under various clients $M$. For all tested $M$, FedAvg and Mb-Sgd start to deteriorate once $K$ passes $2$; Mb-Ac-Sgd is more robust to moderate $K$ than FedAvg and Mb-Sgd but sharply deteriorate once it passes a threshold at around $K=32$. This is because Mb-Ac-Sgd does not have enough gradient steps for convergence when the communication is too sparse. In comparison, FedAc is more robust to infrequent communication. Dataset: a9a, $\ell_2$-regularization strength: $10^{-3}$.
Figure 5: FedAc versus baselines on the observed linear speedup w.r.t $M$ under various synchronization interval $K$. The results are qualitatively similar to \ref{['fig:a9a:1e-3:M']}. Dataset: a9a, $\ell_2$-regularization strength: $10^{-2}$.
...and 19 more figures

Theorems & Definitions (229)

proposition 2.1: label=thm:fedavg:2o:ub,restate=ThmFedAvgSecondOrderUB,name=Convergence Rate for FedAvg, adapted from Khaled.Mishchenko.ea-AISTATS20Woodworth.Patel.ea-ICML20Woodworth.Patel.ea-NeurIPS20
remark 2.2
lemma 2.3: label=lem:fedavg:2o:ub:1,restate=LemFedAvgSecondOrderUBFirst,name=Convergence of shadow trajectory up to variance term
lemma 2.4: label=lem:fedavg:2o:ub:2,restate=LemFedAvgSecondOrderUBSecond,name=Bounded inter-client variance
proof : Proof of \ref{['thm:fedavg:2o:ub']}
definition 2.5: Iterate Bias of SGD
theorem 2.6: Upper bound of the iterate bias under Assumption \ref{['asm:fo:2o']}', simplified from \ref{['thm:fedavg:2o:bias:ub:complete']}
theorem 2.7: Lower bound of the iterate bias under Assumption \ref{['asm:fo:2o']}', simplified from \ref{['thm:fedavg:2o:bias:lb:complete']}
theorem 2.8: Lower bound for homogeneous FedAvg
remark 2.9
...and 219 more

On Principled Local Optimization Methods for Federated Learning

TL;DR

Abstract

On Principled Local Optimization Methods for Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (24)

Theorems & Definitions (229)