Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Yanghao Li; Changxin Liu; Yuhao Yi

Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Yanghao Li, Changxin Liu, Yuhao Yi

Abstract

In collaborative and distributed learning, Byzantine robustness reflects a major facet of optimization algorithms. Such distributed algorithms are often accompanied by transmitting a large number of parameters, so communication compression is essential for an effective solution. In this paper, we propose Byz-DM21, a novel Byzantine-robust and communication-efficient stochastic distributed learning algorithm. Our key innovation is a novel gradient estimator based on a double-momentum mechanism, integrating recent advancements in error feedback techniques. Using this estimator, we design both standard and accelerated algorithms that eliminate the need for large batch sizes while maintaining robustness against Byzantine workers. We prove that the Byz-DM21 algorithm has a smaller neighborhood size and converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-4})$ iterations. To further enhance efficiency, we introduce a distributed variant called Byz-VR-DM21, which incorporates local variance reduction at each node to progressively eliminate variance from random approximations. We show that Byz-VR-DM21 provably converges to $\varepsilon$-stationary points in $\mathcal{O}(\varepsilon^{-3 })$ iterations. Additionally, we extend our results to the case where the functions satisfy the Polyak-Łojasiewicz condition. Finally, numerical experiments demonstrate the effectiveness of the proposed method.

Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Abstract

-stationary points in

iterations. To further enhance efficiency, we introduce a distributed variant called Byz-VR-DM21, which incorporates local variance reduction at each node to progressively eliminate variance from random approximations. We show that Byz-VR-DM21 provably converges to

-stationary points in

iterations. Additionally, we extend our results to the case where the functions satisfy the Polyak-Łojasiewicz condition. Finally, numerical experiments demonstrate the effectiveness of the proposed method.

Paper Structure (40 sections, 23 theorems, 155 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 40 sections, 23 theorems, 155 equations, 12 figures, 5 tables, 1 algorithm.

Introduction
Preliminaries
Byzantine-Robust Distributed Learning
Byzantine-DM21
Convergence of Byz-DM21 for General Non-Convex Problems
Convergence of Byz-DM21 under the Polyak-Łojasiewicz Condition
Incorporating Variance Reduction
Convergence of Byz-VR-DM21 for General Non-Convex Problems
Convergence of Byz-VR-DM21 under the Polyak-Łojasiewicz Condition
Numerical Experiments
Conclusion
Related Work
Intuition Behind Double Momentum
Further Details on Robust Aggregation and Byzantine Attacks
Robust Aggregation
...and 25 more sections

Key Result

Theorem 3.1

Assuming that Assumptions assump:smoothness, assump:heterogeneity, and assump:bound_variance hold, we consider Algorithm alg:SGD2M for solving the distributed learning problem eq:original_P with $B$<${n/2}$ Byzantine workers and communication compression characterized by the parameter $\alpha \in \l where $\hat{x}^{(T)}$ is sampled uniformly at random from the iterations of the method, $\Phi^{(0)}

Figures (12)

Figure 1: The training variance of honest messages under four attack scenarios on the a9a dataset.
Figure 2: The training loss of RFA and CM under four attack scenarios (SF, IPM, LF, ALIE) on the a9a dataset in a heterogeneous setting. We use $k=0.1d$ for both $\text{Rand}_k$ and $\text{Top}_k$ compressors.
Figure 3: The training loss of RFA, CM, and CWTM aggregation rules under four attack scenarios (SF, IPM, LF, ALIE) on the w8a dataset in a heterogeneous setting. BR-DIANA and Byz-VR-MARINA use the $\text{Rand}_k$ compressor with $k = 0.1d$, while Byz-EF21-SGDM, Byz-DM21, and Byz-VR-DM21 use the $\text{Top}_k$ compressor with $k = 0.1d$.
Figure 4: The relative error curve of 3 aggregation rules (RFA, CM, CWTM) under 4 attacks (ALIE, IPM, LF, SF) on the a9a dataset in a heterogeneous setting. The dataset is uniformly split over 12 honest workers with 8 Byzantine workers. BR-LSVRG, Byz-VR-MARINA, Byrd-SAGA and Byz-VR-DM21 with batchsize $b=0.01m$ and step size $\gamma=1/2L$.
Figure 5: The communication complexity comparison under 4 attacks (ALIE, IPM, LF, SF) on the a9a dataset in a heterogeneous setting. Byz-VR-MARINA uses the $\text{Rand}_k$ compressor, while Byz-VR-DM21 uses the $\text{Top}_k$ compressor, with $k=0.1d$, batch size $b=1$, and step size $\gamma=1/2L$.
...and 7 more figures

Theorems & Definitions (43)

Definition 2.5: ($B,\varepsilon$)-Byzantine robustness
Definition 2.6: $(B,\kappa)$-robustness
Definition 2.7: Contractive compressors
Theorem 3.1
Remark 3.2
Corollary 3.3
Corollary 3.4
Remark 3.5
Theorem 3.7
Theorem 4.1
...and 33 more

Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Abstract

Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Authors

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (43)