Table of Contents
Fetching ...

Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering

Changxin Liu, Yanghao Li, Yuhao Yi, Karl H. Johansson

TL;DR

The paper addresses Byzantine robustness and communication efficiency in distributed learning with biased compression and batch-free stochastic gradients. It introduces Byz-EF21-SGDM, a momentum-based method that employs error feedback and robust aggregation to defend against Byzantine workers under information compression, achieving tighter neighborhood guarantees and tight complexity bounds that match Byzantine-free lower bounds. Theoretical results establish convergence rates for non-convex smooth losses and, under heterogeneity, quantify the best achievable accuracy within a provable neighborhood. Empirically, the method demonstrates superior robustness and faster convergence on logistic regression and CNN tasks across multiple attacks and compression schemes. This work advances practical, scalable distributed learning by uniting biased compression with Byzantine robustness and providing concrete performance and complexity guarantees with real-world applicability.

Abstract

Distributed learning has become the standard approach for training large-scale machine learning models across private data silos. While distributed learning enhances privacy preservation and training efficiency, it faces critical challenges related to Byzantine robustness and communication reduction. Existing Byzantine-robust and communication-efficient methods rely on full gradient information either at every iteration or at certain iterations with a probability, and they only converge to an unnecessarily large neighborhood around the solution. Motivated by these issues, we propose a novel Byzantine-robust and communication-efficient stochastic distributed learning method that imposes no requirements on batch size and converges to a smaller neighborhood around the optimal solution than all existing methods, aligning with the theoretical lower bound. Our key innovation is leveraging Polyak Momentum to mitigate the noise caused by both biased compressors and stochastic gradients, thus defending against Byzantine workers under information compression. We provide proof of tight complexity bounds for our algorithm in the context of non-convex smooth loss functions, demonstrating that these bounds match the lower bounds in Byzantine-free scenarios. Finally, we validate the practical significance of our algorithm through an extensive series of experiments, benchmarking its performance on both binary classification and image classification tasks.

Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering

TL;DR

The paper addresses Byzantine robustness and communication efficiency in distributed learning with biased compression and batch-free stochastic gradients. It introduces Byz-EF21-SGDM, a momentum-based method that employs error feedback and robust aggregation to defend against Byzantine workers under information compression, achieving tighter neighborhood guarantees and tight complexity bounds that match Byzantine-free lower bounds. Theoretical results establish convergence rates for non-convex smooth losses and, under heterogeneity, quantify the best achievable accuracy within a provable neighborhood. Empirically, the method demonstrates superior robustness and faster convergence on logistic regression and CNN tasks across multiple attacks and compression schemes. This work advances practical, scalable distributed learning by uniting biased compression with Byzantine robustness and providing concrete performance and complexity guarantees with real-world applicability.

Abstract

Distributed learning has become the standard approach for training large-scale machine learning models across private data silos. While distributed learning enhances privacy preservation and training efficiency, it faces critical challenges related to Byzantine robustness and communication reduction. Existing Byzantine-robust and communication-efficient methods rely on full gradient information either at every iteration or at certain iterations with a probability, and they only converge to an unnecessarily large neighborhood around the solution. Motivated by these issues, we propose a novel Byzantine-robust and communication-efficient stochastic distributed learning method that imposes no requirements on batch size and converges to a smaller neighborhood around the optimal solution than all existing methods, aligning with the theoretical lower bound. Our key innovation is leveraging Polyak Momentum to mitigate the noise caused by both biased compressors and stochastic gradients, thus defending against Byzantine workers under information compression. We provide proof of tight complexity bounds for our algorithm in the context of non-convex smooth loss functions, demonstrating that these bounds match the lower bounds in Byzantine-free scenarios. Finally, we validate the practical significance of our algorithm through an extensive series of experiments, benchmarking its performance on both binary classification and image classification tasks.
Paper Structure (26 sections, 7 theorems, 49 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 7 theorems, 49 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Suppose Assumptions assump:smoothness, assump:bounded_hetero, and assump:bounded_var hold. For Algorithm alg:Byz-EF21-SGDM applied to solve the distributed learning problem eq:original_P in the presence of $f<n/2$ Byzantine workers and communication compression with parameter $\alpha \in (0,1]$ defi then where $\hat{x}^{(T)}$ is sampled uniformly at random from $x^{(0)},x^{(1)},\dots,x^{(T-1)}$,

Figures (2)

  • Figure 1: The training loss of 3 aggregation rules (RFA, CWMed, CWTM) under 4 attacks (SF, IPM, LF, ALIE) on the a9a dataset. The dataset is uniformly split among 20 workers, including 9 Byzantine workers. BR-CSGD, BR-DIANA, and Byz-VR-MARINA use the $\text{Rand}_1$ compressor. Our algorithm (Byz-EF21-SGDM) uses the $\text{Top}_1$ compressor.
  • Figure 2: The testing accuracy of 3 aggregation rules (RFA, CWMed, CWTM) under 2 attacks (SF, LF) on the FEMNIST dataset. The dataset is uniformly split among 20 workers, including 9 Byzantine workers. BR-CSGD, BR-DIANA, and Byz-VR-MARINA use the $\text{Rand}_{k}$ compressor, and our algorithm (Byz-EF21-SGDM) uses the $\text{Top}_{k}$ compressor, where $k=0.1d$.

Theorems & Definitions (15)

  • Definition 1: $(f,\varepsilon)$-Byzantine robustness
  • Definition 2: $(f,\kappa)$-robustness
  • Definition 3: Contractive compressors
  • Theorem 1
  • Corollary 1
  • Corollary 2
  • Remark 1
  • Lemma 1: Descent lemma
  • Lemma 2: Robust aggregation error
  • proof
  • ...and 5 more