Table of Contents
Fetching ...

Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Yanghao Li, Changxin Liu, Yuhao Yi

Abstract

In collaborative and distributed learning, Byzantine robustness reflects a major facet of optimization algorithms. Such distributed algorithms are often accompanied by transmitting a large number of parameters, so communication compression is essential for an effective solution. In this paper, we propose Byz-DM21, a novel Byzantine-robust and communication-efficient stochastic distributed learning algorithm. Our key innovation is a novel gradient estimator based on a double-momentum mechanism, integrating recent advancements in error feedback techniques. Using this estimator, we design both standard and accelerated algorithms that eliminate the need for large batch sizes while maintaining robustness against Byzantine workers. We prove that the Byz-DM21 algorithm has a smaller neighborhood size and converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-4})$ iterations. To further enhance efficiency, we introduce a distributed variant called Byz-VR-DM21, which incorporates local variance reduction at each node to progressively eliminate variance from random approximations. We show that Byz-VR-DM21 provably converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-3 })$ iterations. Additionally, we extend our results to the case where the functions satisfy the Polyak-Łojasiewicz condition. Finally, numerical experiments demonstrate the effectiveness of the proposed method.

Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Abstract

In collaborative and distributed learning, Byzantine robustness reflects a major facet of optimization algorithms. Such distributed algorithms are often accompanied by transmitting a large number of parameters, so communication compression is essential for an effective solution. In this paper, we propose Byz-DM21, a novel Byzantine-robust and communication-efficient stochastic distributed learning algorithm. Our key innovation is a novel gradient estimator based on a double-momentum mechanism, integrating recent advancements in error feedback techniques. Using this estimator, we design both standard and accelerated algorithms that eliminate the need for large batch sizes while maintaining robustness against Byzantine workers. We prove that the Byz-DM21 algorithm has a smaller neighborhood size and converges to -stationary points in iterations. To further enhance efficiency, we introduce a distributed variant called Byz-VR-DM21, which incorporates local variance reduction at each node to progressively eliminate variance from random approximations. We show that Byz-VR-DM21 provably converges to -stationary points in iterations. Additionally, we extend our results to the case where the functions satisfy the Polyak-Łojasiewicz condition. Finally, numerical experiments demonstrate the effectiveness of the proposed method.
Paper Structure (40 sections, 23 theorems, 155 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 40 sections, 23 theorems, 155 equations, 12 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

Assuming that Assumptions assump:smoothness, assump:heterogeneity, and assump:bound_variance hold, we consider Algorithm alg:SGD2M for solving the distributed learning problem eq:original_P with $B$<${n/2}$ Byzantine workers and communication compression characterized by the parameter $\alpha \in \l where $\hat{x}^{(T)}$ is sampled uniformly at random from the iterations of the method, $\Phi^{(0)}

Figures (12)

  • Figure 1: The training variance of honest messages under four attack scenarios on the a9a dataset.
  • Figure 2: The training loss of RFA and CM under four attack scenarios (SF, IPM, LF, ALIE) on the a9a dataset in a heterogeneous setting. We use $k=0.1d$ for both $\text{Rand}_k$ and $\text{Top}_k$ compressors.
  • Figure 3: The training loss of RFA, CM, and CWTM aggregation rules under four attack scenarios (SF, IPM, LF, ALIE) on the w8a dataset in a heterogeneous setting. BR-DIANA and Byz-VR-MARINA use the $\text{Rand}_k$ compressor with $k = 0.1d$, while Byz-EF21-SGDM, Byz-DM21, and Byz-VR-DM21 use the $\text{Top}_k$ compressor with $k = 0.1d$.
  • Figure 4: The relative error curve of 3 aggregation rules (RFA, CM, CWTM) under 4 attacks (ALIE, IPM, LF, SF) on the a9a dataset in a heterogeneous setting. The dataset is uniformly split over 12 honest workers with 8 Byzantine workers. BR-LSVRG, Byz-VR-MARINA, Byrd-SAGA and Byz-VR-DM21 with batchsize $b=0.01m$ and step size $\gamma=1/2L$.
  • Figure 5: The communication complexity comparison under 4 attacks (ALIE, IPM, LF, SF) on the a9a dataset in a heterogeneous setting. Byz-VR-MARINA uses the $\text{Rand}_k$ compressor, while Byz-VR-DM21 uses the $\text{Top}_k$ compressor, with $k=0.1d$, batch size $b=1$, and step size $\gamma=1/2L$.
  • ...and 7 more figures

Theorems & Definitions (43)

  • Definition 2.5: ($B,\varepsilon$)-Byzantine robustness
  • Definition 2.6: $(B,\kappa)$-robustness
  • Definition 2.7: Contractive compressors
  • Theorem 3.1
  • Remark 3.2
  • Corollary 3.3
  • Corollary 3.4
  • Remark 3.5
  • Theorem 3.7
  • Theorem 4.1
  • ...and 33 more