Table of Contents
Fetching ...

Byzantine-Resilient Gradient Coding through Local Gradient Computations

Christoph Hofmeister, Luis Maßny, Eitan Yaakobi, Rawad Bitar

TL;DR

This work tackles exact gradient coding in distributed learning under adversarial (Byzantine) workers. It proposes the s-BGC framework, which augments gradient coding with interactive rounds and a small amount of local main-node computation to detect and prune Byzantine inputs, thereby reducing replication from $2s+1$ to $s+1$ for $s$ malicious workers. The authors derive fundamental trade-offs and lower bounds on replication, computation, and communication, and present an achievable scheme that attains these limits within a fractional repetition data layout; the scheme generalizes to a higher replication $s+u$ to balance computation and communication or to tolerate stragglers. Overall, the work provides a principled framework and practical scheme for Byzantine-resilient gradient coding with near-optimal replication and a clear path to handling additional system constraints.

Abstract

We consider gradient coding in the presence of an adversary controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the responses from malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partial gradient is computed by. In this work, we propose a way to reduce the replication to $s+1$ instead of $2s+1$ in the presence of $s$ malicious workers. Our method detects erroneous inputs from the malicious workers, transforming them into erasures. This comes at the expense of $s$ additional local computations at the main node and additional rounds of light communication between the main node and the workers. We define a general framework and give fundamental limits for fractional repetition data allocations. Our scheme is optimal in terms of replication and local computation and incurs a communication cost that is asymptotically, in the size of the dataset, a multiplicative factor away from the derived bound. We furthermore show how additional redundancy can be exploited to reduce the number of local computations and communication cost, or, alternatively, tolerate straggling workers.

Byzantine-Resilient Gradient Coding through Local Gradient Computations

TL;DR

This work tackles exact gradient coding in distributed learning under adversarial (Byzantine) workers. It proposes the s-BGC framework, which augments gradient coding with interactive rounds and a small amount of local main-node computation to detect and prune Byzantine inputs, thereby reducing replication from to for malicious workers. The authors derive fundamental trade-offs and lower bounds on replication, computation, and communication, and present an achievable scheme that attains these limits within a fractional repetition data layout; the scheme generalizes to a higher replication to balance computation and communication or to tolerate stragglers. Overall, the work provides a principled framework and practical scheme for Byzantine-resilient gradient coding with near-optimal replication and a clear path to handling additional system constraints.

Abstract

We consider gradient coding in the presence of an adversary controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the responses from malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partial gradient is computed by. In this work, we propose a way to reduce the replication to instead of in the presence of malicious workers. Our method detects erroneous inputs from the malicious workers, transforming them into erasures. This comes at the expense of additional local computations at the main node and additional rounds of light communication between the main node and the workers. We define a general framework and give fundamental limits for fractional repetition data allocations. Our scheme is optimal in terms of replication and local computation and incurs a communication cost that is asymptotically, in the size of the dataset, a multiplicative factor away from the derived bound. We furthermore show how additional redundancy can be exploited to reduce the number of local computations and communication cost, or, alternatively, tolerate straggling workers.
Paper Structure (20 sections, 6 theorems, 26 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 20 sections, 6 theorems, 26 equations, 6 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Suppose that $n\xspace = m\xspace (s\xspace + u)$ for integers $m\xspace,u \geq 1$. For any $s\xspace$-BGC scheme with parameters ($r_{}$,$c_{}$,$\rho_{}$,$\kappa_{}$) that has a fractional repetition data assignment, it holds that if $\rho_{}\xspace\leq s\xspace + u$, then $c_{}\xspace\geq \left\lf

Figures (6)

  • Figure 1: Illustration of a distributed gradient descent setting with adversaries.
  • Figure 2: Trade-off between local computations $c_{}\xspace$ and (normalized) replication $\bar{\rho_{}\xspace}=\rho_{}\xspace/s\xspace$. For $\bar{\rho_{}\xspace} \geq 1 + \frac{1}{s}$, points on or above the blue curve $c = \lfloor \frac{1}{\bar{\rho_{}\xspace}-1} \rfloor$ are achievable. Points below or to the left of the curve are fundamentally impossible, including the shaded region $\bar{\rho_{}\xspace} \leq 1$, which is unattainable for any number of local computations.
  • Figure 3: Comparison of converse and achievability for $\kappa_{}\xspace$ over the dataset size $p\xspace$. We consider a system of $n\xspace=10$ workers, $m\xspace=1$ group and an alphabet size $|\mathcal{A}\xspace|=2^{16}$. As the percentage of malicious workers rises from 50% to 90% the communication overhead of the scheme as well as the lower bound increase.
  • Figure 4: Example of a match tree for $W_{j\xspace}\xspace$ and parameters $m\xspace=1,p\xspace=4$.
  • Figure 5: Tradeoff between total worker to main-node communication and the number of local computations for our scheme. The parameters are $s\xspace = 10$, $1 \leq u \leq 11$, $m\xspace=1$, $p\xspace = 1.0e4$, $d\xspace=1.0e6$, $|\mathcal{A}\xspace|=2^{16}$. When considering total communication, the cost of the protocol is outweighed by the cost of transmitting the gradient values. The gap to our bound, given in \ref{['thm:converse_commoh']}, is less than 5k. For $c_{}\xspace=0$ local computations, our scheme is equivalent to DRACO chenDRACOByzantineresilientDistributed2018. By requiring fewer workers, our scheme with $u=1$ reduces communication by 48% at the expense of at most $c_{}\xspace=10$ local gradient computations at the main node. For comparison, each worker node performs $p\xspace=1.0e4$ gradient computations.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 1: Lower bound on $c_{}\xspace$ and $\rho_{}\xspace$
  • Theorem 2: Lower bound on $\kappa_{}\xspace$ for fixed $c_{}\xspace$ and $\rho_{}\xspace$
  • Theorem 3
  • Corollary 1
  • Corollary 2
  • Definition 1: Byzantine-resilient gradient coding scheme
  • Definition 2
  • Corollary 3: Computation of Disagreement Gradients
  • Remark 1: Compression beyond the alphabet size