Byzantine-Resilient Gradient Coding through Local Gradient Computations
Christoph Hofmeister, Luis Maßny, Eitan Yaakobi, Rawad Bitar
TL;DR
This work tackles exact gradient coding in distributed learning under adversarial (Byzantine) workers. It proposes the s-BGC framework, which augments gradient coding with interactive rounds and a small amount of local main-node computation to detect and prune Byzantine inputs, thereby reducing replication from $2s+1$ to $s+1$ for $s$ malicious workers. The authors derive fundamental trade-offs and lower bounds on replication, computation, and communication, and present an achievable scheme that attains these limits within a fractional repetition data layout; the scheme generalizes to a higher replication $s+u$ to balance computation and communication or to tolerate stragglers. Overall, the work provides a principled framework and practical scheme for Byzantine-resilient gradient coding with near-optimal replication and a clear path to handling additional system constraints.
Abstract
We consider gradient coding in the presence of an adversary controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the responses from malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partial gradient is computed by. In this work, we propose a way to reduce the replication to $s+1$ instead of $2s+1$ in the presence of $s$ malicious workers. Our method detects erroneous inputs from the malicious workers, transforming them into erasures. This comes at the expense of $s$ additional local computations at the main node and additional rounds of light communication between the main node and the workers. We define a general framework and give fundamental limits for fractional repetition data allocations. Our scheme is optimal in terms of replication and local computation and incurs a communication cost that is asymptotically, in the size of the dataset, a multiplicative factor away from the derived bound. We furthermore show how additional redundancy can be exploited to reduce the number of local computations and communication cost, or, alternatively, tolerate straggling workers.
