Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering

Changxin Liu; Yanghao Li; Yuhao Yi; Karl H. Johansson

Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering

Changxin Liu, Yanghao Li, Yuhao Yi, Karl H. Johansson

TL;DR

The paper addresses Byzantine robustness and communication efficiency in distributed learning with biased compression and batch-free stochastic gradients. It introduces Byz-EF21-SGDM, a momentum-based method that employs error feedback and robust aggregation to defend against Byzantine workers under information compression, achieving tighter neighborhood guarantees and tight complexity bounds that match Byzantine-free lower bounds. Theoretical results establish convergence rates for non-convex smooth losses and, under heterogeneity, quantify the best achievable accuracy within a provable neighborhood. Empirically, the method demonstrates superior robustness and faster convergence on logistic regression and CNN tasks across multiple attacks and compression schemes. This work advances practical, scalable distributed learning by uniting biased compression with Byzantine robustness and providing concrete performance and complexity guarantees with real-world applicability.

Abstract

Distributed learning has become the standard approach for training large-scale machine learning models across private data silos. While distributed learning enhances privacy preservation and training efficiency, it faces critical challenges related to Byzantine robustness and communication reduction. Existing Byzantine-robust and communication-efficient methods rely on full gradient information either at every iteration or at certain iterations with a probability, and they only converge to an unnecessarily large neighborhood around the solution. Motivated by these issues, we propose a novel Byzantine-robust and communication-efficient stochastic distributed learning method that imposes no requirements on batch size and converges to a smaller neighborhood around the optimal solution than all existing methods, aligning with the theoretical lower bound. Our key innovation is leveraging Polyak Momentum to mitigate the noise caused by both biased compressors and stochastic gradients, thus defending against Byzantine workers under information compression. We provide proof of tight complexity bounds for our algorithm in the context of non-convex smooth loss functions, demonstrating that these bounds match the lower bounds in Byzantine-free scenarios. Finally, we validate the practical significance of our algorithm through an extensive series of experiments, benchmarking its performance on both binary classification and image classification tasks.

Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering

TL;DR

Abstract

Paper Structure (26 sections, 7 theorems, 49 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 7 theorems, 49 equations, 2 figures, 2 tables, 1 algorithm.

Introduction
Related Works
Byzantine-robust distributed learning
Byzantine-robust learning under information compression
Main Contributions
The first Byzantine-robust stochastic distributed learning method with error feedback
New complexity results
Smaller size of the neighborhood
Problem Statement and Preliminaries
Standard Byzantine-robust methods
Communication compression
Brittleness of existing communication-efficient and Byzantine-robust solutions
Communication-Efficient and Byzantine-Robust Distributed Learning
Algorithm description
Rate of convergence
...and 11 more sections

Key Result

Theorem 1

Suppose Assumptions assump:smoothness, assump:bounded_hetero, and assump:bounded_var hold. For Algorithm alg:Byz-EF21-SGDM applied to solve the distributed learning problem eq:original_P in the presence of $f<n/2$ Byzantine workers and communication compression with parameter $\alpha \in (0,1]$ defi then where $\hat{x}^{(T)}$ is sampled uniformly at random from $x^{(0)},x^{(1)},\dots,x^{(T-1)}$,

Figures (2)

Figure 1: The training loss of 3 aggregation rules (RFA, CWMed, CWTM) under 4 attacks (SF, IPM, LF, ALIE) on the a9a dataset. The dataset is uniformly split among 20 workers, including 9 Byzantine workers. BR-CSGD, BR-DIANA, and Byz-VR-MARINA use the $\text{Rand}_1$ compressor. Our algorithm (Byz-EF21-SGDM) uses the $\text{Top}_1$ compressor.
Figure 2: The testing accuracy of 3 aggregation rules (RFA, CWMed, CWTM) under 2 attacks (SF, LF) on the FEMNIST dataset. The dataset is uniformly split among 20 workers, including 9 Byzantine workers. BR-CSGD, BR-DIANA, and Byz-VR-MARINA use the $\text{Rand}_{k}$ compressor, and our algorithm (Byz-EF21-SGDM) uses the $\text{Top}_{k}$ compressor, where $k=0.1d$.

Theorems & Definitions (15)

Definition 1: $(f,\varepsilon)$-Byzantine robustness
Definition 2: $(f,\kappa)$-robustness
Definition 3: Contractive compressors
Theorem 1
Corollary 1
Corollary 2
Remark 1
Lemma 1: Descent lemma
Lemma 2: Robust aggregation error
proof
...and 5 more

Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering

TL;DR

Abstract

Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (15)