Momentum Approximation in Asynchronous Private Federated Learning

Tao Yu; Congzheng Song; Jianyu Wang; Mona Chitnis

Momentum Approximation in Asynchronous Private Federated Learning

Tao Yu, Congzheng Song, Jianyu Wang, Mona Chitnis

TL;DR

This work tackles the challenge of combining momentum with asynchronous federated learning by identifying an implicit momentum bias caused by stale updates in AsyncFL. It introduces momentum approximation (MA), an online least-squares weighting scheme that makes the effective history weights approximate the synchronous momentum, thereby recovering acceleration without extensive hyperparameter tuning. The authors demonstrate, on large-scale benchmarks with and without differential privacy, that MA and its light-weight variant achieve substantial convergence speedups (up to 4x) and notable utility gains (3–20%), while remaining compatible with secure aggregation and DP. The method is simple to implement in production FL systems and reduces the need for extensive momentum tuning across tasks, improving scalability and privacy-preserving performance in asynchronous settings.

Abstract

Asynchronous protocols have been shown to improve the scalability of federated learning (FL) with a massive number of clients. Meanwhile, momentum-based methods can achieve the best model quality in synchronous FL. However, naively applying momentum in asynchronous FL algorithms leads to slower convergence and degraded model performance. It is still unclear how to effective combinie these two techniques together to achieve a win-win. In this paper, we find that asynchrony introduces implicit bias to momentum updates. In order to address this problem, we propose momentum approximation that minimizes the bias by finding an optimal weighted average of all historical model updates. Momentum approximation is compatible with secure aggregation as well as differential privacy, and can be easily integrated in production FL systems with a minor communication and storage cost. We empirically demonstrate that on benchmark FL datasets, momentum approximation can achieve $1.15 \textrm{--}4\times$ speed up in convergence compared to naively combining asynchronous FL with momentum.

Momentum Approximation in Asynchronous Private Federated Learning

TL;DR

Abstract

speed up in convergence compared to naively combining asynchronous FL with momentum.

Paper Structure (28 sections, 7 theorems, 41 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 28 sections, 7 theorems, 41 equations, 6 figures, 2 tables, 2 algorithms.

Introduction
Background
Applying Momentum to Asynchronous FL
Implicit Momentum Bias
Proposed Method: Momentum Approximation
Experiments
Baselines
Results
Related Work
Conclusion
Additional Experiments Details
Experimental Setup
Client delay distribution.
Staleness scaling and bounding.
Hyperparameters.
...and 13 more sections

Key Result

Proposition 3.5

Suppose the server model $\bm{\theta}$ is updated using momentum method as follows: This update rule is equivalent to $\bm{\theta}_{t+1} = \bm{\theta}_t - \eta (1-\beta) \sum_{s=1}^{t} \beta^{t-s} \bm{r}_s$. The final model after total $T$ iterations can be written as: where ${\bm{M}} \in \mathbb{R}^{T \times T}$ is a lower-triangular matrix defined as: ${\bm{M}}_{[t,s]} = .$

Figures (6)

Figure 1: (Left and Middle) In SyncFL, FedAvgM and FedAdam with momentum parameter $\beta=0.9$ converges fastest while it is not the case in AsyncFL: no momentum ($\beta=0$) or smaller $\beta=0.5$ is better. (Right) The parameter $\beta^\prime$ for the second moments in FedAdam, on the other hand, has consistent impact on SyncFL and AsyncFL, i.e. larger $\beta^\prime=0.99$ is better.
Figure 2: Visualization of the desired momentum matrix ${\bm{M}}$ ($\beta=0.9$), the implicit momentum matrix ${\bm{M}}{\bm{W}}$, the approximated momentum matrix ${\bm{A}}{\bm{W}}$, the staleness coefficient matrix ${\bm{W}}$, and the solved weighting matrix ${\bm{A}}$ in momentum approximation.
Figure 3: Comparison between MA, light-weight MA (MA-L) and baseline approaches.
Figure 4: Impact of $\beta$ on SyncFL and AsyncFL with MA on the StackOverflow dataset.
Figure 5: Comparison between MA, light-weight MA (MA-L) and baseline approaches with DP.
...and 1 more figures

Theorems & Definitions (12)

Definition 2.1: Differential Privacy dwork2006calibrating
Proposition 3.5
Theorem 3.6
Theorem 3.7
Lemma B.1
proof
Lemma B.2
proof
Theorem B.3
proof
...and 2 more

Momentum Approximation in Asynchronous Private Federated Learning

TL;DR

Abstract

Momentum Approximation in Asynchronous Private Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (12)