Interactive Byzantine-Resilient Gradient Coding for General Data Assignments
Shreyas Jain, Luis Maßny, Christoph Hofmeister, Eitan Yaakobi, Rawad Bitar
TL;DR
The paper tackles exact gradient recovery in distributed learning under Byzantine adversaries by extending the gradient coding framework to general regular data assignments. It introduces an interactive, group-based protocol that combines MDS-encoded gradient sums with a systematic elimination tournament to identify malicious workers, achieving recovery with replication $\rho = s+u$ and at most $s+1-u$ rounds. The key contributions are a generalized construction of encoding/decoding matrices using Vandermonde/MDS codes, a scalable grouping strategy with provable optimality, and a log-scale communication overhead in the data size $p$. The approach broadens applicability beyond fractional repetition, enabling exact gradient recovery with reduced replication, and provides a foundation for future improvements in fundamental limits for general data assignments.
Abstract
We tackle the problem of Byzantine errors in distributed gradient descent within the Byzantine-resilient gradient coding framework. Our proposed solution can recover the exact full gradient in the presence of $s$ malicious workers with a data replication factor of only $s+1$. It generalizes previous solutions to any data assignment scheme that has a regular replication over all data samples. The scheme detects malicious workers through additional interactive communication and a small number of local computations at the main node, leveraging group-wise comparisons between workers with a provably optimal grouping strategy. The scheme requires at most $s$ interactive rounds that incur a total communication cost logarithmic in the number of data samples.
