Table of Contents
Fetching ...

Reconciling Communication Compression and Byzantine-Robustness in Distributed Learning

Diksha Gupta, Antonio Honsell, Chuan Xu, Nirupam Gupta, Giovanni Neglia

TL;DR

This work tackles the dual challenges of Byzantine robustness and communication efficiency in distributed learning by introducing RoSDHB, a lightweight algorithm that applies classical Polyak momentum on the server after coordinated gradient sparsification with a global mask. RoSDHB achieves convergence guarantees comparable to the state-of-the-art Byz-DASHA-PAGE under standard $(G,B)$-gradient dissimilarity without requiring bounded global Hessian variance, while reducing worker memory and uplink communication. Theoretical analysis employs a novel Lyapunov function to handle sparsification noise that scales with the gradient norm, and reveals a tight coupling between compression and robustness when data heterogeneity is present. Empirically, RoSDHB demonstrates stronger robustness and substantial communication savings on MNIST and CIFAR-10 under FOE and ALIE attacks, outperforming the SOTA by up to several-fold in speed and efficiency, validating its practical impact for scalable, secure distributed learning.

Abstract

Distributed learning enables scalable model training over decentralized data, but remains hindered by Byzantine faults and high communication costs. While both challenges have been studied extensively in isolation, their interplay has received limited attention. Prior work has shown that naively combining communication compression with Byzantine-robust aggregation can severely weaken resilience to faulty nodes. The current state-of-the-art, Byz-DASHA-PAGE, leverages a momentum-based variance reduction scheme to counteract the negative effect of compression noise on Byzantine robustness. In this work, we introduce RoSDHB, a new algorithm that integrates classical Polyak momentum with a coordinated compression strategy. Theoretically, RoSDHB matches the convergence guarantees of Byz-DASHA-PAGE under the standard $(G,B)$-gradient dissimilarity model, while relying on milder assumptions and requiring less memory and communication per client. Empirically, RoSDHB demonstrates stronger robustness while achieving substantial communication savings compared to Byz-DASHA-PAGE.

Reconciling Communication Compression and Byzantine-Robustness in Distributed Learning

TL;DR

This work tackles the dual challenges of Byzantine robustness and communication efficiency in distributed learning by introducing RoSDHB, a lightweight algorithm that applies classical Polyak momentum on the server after coordinated gradient sparsification with a global mask. RoSDHB achieves convergence guarantees comparable to the state-of-the-art Byz-DASHA-PAGE under standard -gradient dissimilarity without requiring bounded global Hessian variance, while reducing worker memory and uplink communication. Theoretical analysis employs a novel Lyapunov function to handle sparsification noise that scales with the gradient norm, and reveals a tight coupling between compression and robustness when data heterogeneity is present. Empirically, RoSDHB demonstrates stronger robustness and substantial communication savings on MNIST and CIFAR-10 under FOE and ALIE attacks, outperforming the SOTA by up to several-fold in speed and efficiency, validating its practical impact for scalable, secure distributed learning.

Abstract

Distributed learning enables scalable model training over decentralized data, but remains hindered by Byzantine faults and high communication costs. While both challenges have been studied extensively in isolation, their interplay has received limited attention. Prior work has shown that naively combining communication compression with Byzantine-robust aggregation can severely weaken resilience to faulty nodes. The current state-of-the-art, Byz-DASHA-PAGE, leverages a momentum-based variance reduction scheme to counteract the negative effect of compression noise on Byzantine robustness. In this work, we introduce RoSDHB, a new algorithm that integrates classical Polyak momentum with a coordinated compression strategy. Theoretically, RoSDHB matches the convergence guarantees of Byz-DASHA-PAGE under the standard -gradient dissimilarity model, while relying on milder assumptions and requiring less memory and communication per client. Empirically, RoSDHB demonstrates stronger robustness while achieving substantial communication savings compared to Byz-DASHA-PAGE.

Paper Structure

This paper contains 38 sections, 10 theorems, 131 equations, 8 figures, 7 tables.

Key Result

Theorem 1

Under Assumptions assumption:smoothness and assumption:gradientGB, Algorithm alg:rghb with an $(f,\kappa)$-robust aggregation rule $F$ such that $\kappa B^2 \leq \tfrac{1}{7}$, a learning rate $\gamma \leq \tfrac{k}{d c L}$ (with $c = 23200$), and a momentum coefficient $\beta = \sqrt{1 - 24\gamma L where $\mathcal{L}_\mathcal{H}^* = \min_{\theta \in \mathbb{R}^d} \mathcal{L}_\mathcal{H}(\theta)$.

Figures (8)

  • Figure 1: The convergence plots of RoSDHB and Byz-DASHA-PAGE (SOTA) under varying number of Byzantine workers $f$ and sampling ratios $k/d$.
  • Figure 2: Comparison of RoSDHB and Byz-DASHA-PAGE (SOTA) convergence time required to reach various test threshold accuracy under varying Byzantine workers $f\in\{1,3 \}$.
  • Figure 3: Comparison of convergence time and communication cost between RoSDHB and Byz-DASHA-PAGE (SOTA) under varying sampling ratios and number of Byzantine workers $f \in \{1,3\}$.
  • Figure 4: The convergence plot of RoSDHB and Byz-DASHA-PAGE (SOTA) on MNIST with $k/d=0.1$ under varying data heterogeneity.
  • Figure 5: The convergence plots of RoSDHB and Byz-DASHA-PAGE (SOTA) on MNIST under the FOE attack for varying sampling ratio $k/d \in \{ 0.05, 0.1, 0.3, 0.5, 1.0\}$ and number of byzantine workers $f \in \{1,3,5\}$.
  • ...and 3 more figures

Theorems & Definitions (22)

  • Definition 2.1: $(f, \epsilon)$-Resilience
  • Definition 2.2: $(f,\kappa)$-Robustness
  • Theorem 1
  • proof : Proof sketch
  • Corollary 1
  • Lemma A.1: Global $L$-smoothness without global Hessian-variance
  • proof
  • Lemma B.1
  • proof
  • Lemma B.2
  • ...and 12 more