Table of Contents
Fetching ...

Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding

Chengxi Li, Youssef Allouah, Rachid Guerraoui, Mikael Skoglund, Ming Xiao

Abstract

In this paper, we study the problem of distributed training (DT) under Byzantine attacks with communication constraints. While prior work has developed various robust aggregation rules at the server to enhance robustness to Byzantine attacks, the existing methods suffer from a critical limitation in that the solution error does not diminish when the local gradients sent by different devices vary considerably, as a result of data heterogeneity among the subsets held by different devices. To overcome this limitation, we propose a novel DT method, cyclic gradient coding-based DT (LAD). In LAD, the server allocates the entire training dataset to the devices before training begins. In each iteration, it assigns computational tasks redundantly to the devices using cyclic gradient coding. Each honest device then computes local gradients on a fixed number of data subsets and encodes the local gradients before transmitting to the server. The server aggregates the coded vectors from the honest devices and the potentially incorrect messages from Byzantine devices using a robust aggregation rule. Leveraging the redundancy of computation across devices, the convergence performance of LAD is analytically characterized, demonstrating improved robustness against Byzantine attacks and significantly lower solution error. Furthermore, we extend LAD to a communication-efficient variant, compressive and cyclic gradient coding-based DT (Com-LAD), which further reduces communication overhead under constrained settings. Numerical results validate the effectiveness of the proposed methods in enhancing both Byzantine resilience and communication efficiency.

Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding

Abstract

In this paper, we study the problem of distributed training (DT) under Byzantine attacks with communication constraints. While prior work has developed various robust aggregation rules at the server to enhance robustness to Byzantine attacks, the existing methods suffer from a critical limitation in that the solution error does not diminish when the local gradients sent by different devices vary considerably, as a result of data heterogeneity among the subsets held by different devices. To overcome this limitation, we propose a novel DT method, cyclic gradient coding-based DT (LAD). In LAD, the server allocates the entire training dataset to the devices before training begins. In each iteration, it assigns computational tasks redundantly to the devices using cyclic gradient coding. Each honest device then computes local gradients on a fixed number of data subsets and encodes the local gradients before transmitting to the server. The server aggregates the coded vectors from the honest devices and the potentially incorrect messages from Byzantine devices using a robust aggregation rule. Leveraging the redundancy of computation across devices, the convergence performance of LAD is analytically characterized, demonstrating improved robustness against Byzantine attacks and significantly lower solution error. Furthermore, we extend LAD to a communication-efficient variant, compressive and cyclic gradient coding-based DT (Com-LAD), which further reduces communication overhead under constrained settings. Numerical results validate the effectiveness of the proposed methods in enhancing both Byzantine resilience and communication efficiency.

Paper Structure

This paper contains 23 sections, 7 theorems, 70 equations, 6 figures, 2 algorithms.

Key Result

Lemma 1

Suppose the set $\mathcal{S}$ contains all matrices in which each row has exactly $d$ entries equal to one, and all other entries are zero. It holds that Here, $\mathbf{h}^{1 \times N}$ is a random vector where exactly $H$ elements are randomly and uniformly selected to be ones, and the remaining elements are zeros.

Figures (6)

  • Figure 1: The implementation of a single iteration of LAD.
  • Figure 2: The error term as a function of $\delta$.
  • Figure 3: The error term as a function of $d$.
  • Figure 4: The training loss as a function of number of iterations for different methods.
  • Figure 5: The training loss as a function of number of iterations for different methods under different values of $\sigma_H$. (a) $\sigma_H = 0$. (b) $\sigma_H = 0.1$.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Remark 1
  • Definition 1: $\kappa$-robustness
  • Definition 2: Unbiased compression functions
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Corollary 1
  • ...and 7 more