Table of Contents
Fetching ...

Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation

Kaoru Otsuka, Yuki Takezawa, Makoto Yamada

TL;DR

The paper addresses the challenge of Byzantine-robust federated learning under partial participation by introducing Delayed Momentum Aggregation (DeMoA). The core idea is to aggregate both fresh momentum from sampled clients and cached momentum from non-sampled clients, preserving a minority of Byzantine influence in every round. The authors provide a convergence guarantee under standard assumptions, show that DeMoA remains robust even with partial participation, and demonstrate superior performance against multiple Byzantine attacks on image datasets. The results highlight DeMoA’s practical impact for scalable, robust FL in real-world networks with intermittent client participation. Under overparameterization, the method achieves even stronger convergence, effectively mitigating the non-vanishing error terms caused by Byzantine behavior and data heterogeneity.

Abstract

Partial participation is essential for communication-efficient federated learning at scale, yet existing Byzantine-robust methods typically assume full client participation. In the partial participation setting, a majority of the sampled clients may be Byzantine, once Byzantine clients dominate, existing methods break down immediately. We introduce delayed momentum aggregation, a principle where the central server aggregates cached momentum from non-sampled clients along with fresh momentum from sampled clients. This principle ensures Byzantine clients remain a minority from the server's perspective even when they dominate the sampled set. We instantiate this principle in our optimizer DeMoA. We analyze the convergence rate of DeMoA, showing that DeMoA is Byzantine-robust under partial participation. Experiments show that, with 20% Byzantine ratio and only 10% partial participation rate, DeMoA achieves the best accuracy even when existing methods fail empirically.

Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation

TL;DR

The paper addresses the challenge of Byzantine-robust federated learning under partial participation by introducing Delayed Momentum Aggregation (DeMoA). The core idea is to aggregate both fresh momentum from sampled clients and cached momentum from non-sampled clients, preserving a minority of Byzantine influence in every round. The authors provide a convergence guarantee under standard assumptions, show that DeMoA remains robust even with partial participation, and demonstrate superior performance against multiple Byzantine attacks on image datasets. The results highlight DeMoA’s practical impact for scalable, robust FL in real-world networks with intermittent client participation. Under overparameterization, the method achieves even stronger convergence, effectively mitigating the non-vanishing error terms caused by Byzantine behavior and data heterogeneity.

Abstract

Partial participation is essential for communication-efficient federated learning at scale, yet existing Byzantine-robust methods typically assume full client participation. In the partial participation setting, a majority of the sampled clients may be Byzantine, once Byzantine clients dominate, existing methods break down immediately. We introduce delayed momentum aggregation, a principle where the central server aggregates cached momentum from non-sampled clients along with fresh momentum from sampled clients. This principle ensures Byzantine clients remain a minority from the server's perspective even when they dominate the sampled set. We instantiate this principle in our optimizer DeMoA. We analyze the convergence rate of DeMoA, showing that DeMoA is Byzantine-robust under partial participation. Experiments show that, with 20% Byzantine ratio and only 10% partial participation rate, DeMoA achieves the best accuracy even when existing methods fail empirically.

Paper Structure

This paper contains 37 sections, 13 theorems, 77 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

Suppose Assumptions assmp:smoothness,assmp:variance,and assmp:hetero hold. Let the stepsize $\eta$ and momentum $\alpha_t$ for $t \geq 2$ be and $\alpha_1 = p_1 = 1.$ Also, with condition on the number of honest clients $G$ and the Byzantine ratio $\delta$ as: Then, the iterates $\{\bm x^t\}_{t=0}^{T-1}$ generated by DeMoA satisfy where the expectation is taken over all sources of randomness i

Figures (9)

  • Figure 1: Byzantine-robust training with Byzantine ratio $\delta=0.2$ using centered clipping (CCLIP). Top: IID data; bottom: non-IID data; columns correspond to different attacks. FedAvg and FedCM eventually encounter a Byzantine majority round and collapse, while Byz-VR-MARINA-PP remains stable but attains lower test accuracy than DeMoA due to their bias from clipping.
  • Figure 2: Training ResNet-18 on CIFAR-10 without Byzantine clients ($\delta=0$) under partial participation rate $p=0.1$ using naive averaging (avg). Top: training loss (lower is better); bottom: test accuracy (higher is better). Left: IID; right: non-IID. FedCM follows similar trajectories as FedAvg because the gains from explicit momentum vanish: the implicit momentum effect largely dominates the updates. In contrast, DeMoA mitigates this effect via delayed momentum aggregation, which remains effective even without Byzantine clients.
  • Figure 3: Training ConvNet on MNIST with no Byzantine clients ($\delta=0$) using naive averaging (avg) aggregator. Each subplot shows IID partition (left panel) and non-IID partition (right panel).
  • Figure 4: Training ResNet-18 on CIFAR-10 with no Byzantine clients ($\delta=0$) using naive averaging (avg) aggregator. Each subplot shows IID partition (left panel) and non-IID partition (right panel).
  • Figure 5: Training with Byzantine ratio $\delta=0.2$ using centered clipping (cp) aggregator. The top row shows IID splits and the bottom row shows non-IID splits with bucketing size $s=2$; from left to right, the columns correspond to ALIE, Bit-Flipping (BF), IPM, Label-Flipping (LF), and Mimic attacks.
  • ...and 4 more figures

Theorems & Definitions (23)

  • Definition 2.4: $(\delta,c)$-Robust Aggregator gorbunov_variance_2023
  • Theorem 3.1
  • Corollary 3.2: Convergence under Overparameterization
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3: Young's inequality
  • Lemma 5.1: Descent Lemma
  • proof
  • Lemma 5.2: Aggregation Error
  • Proposition 5.3: Local variance term
  • ...and 13 more