Interactive Byzantine-Resilient Gradient Coding for General Data Assignments

Shreyas Jain; Luis Maßny; Christoph Hofmeister; Eitan Yaakobi; Rawad Bitar

Interactive Byzantine-Resilient Gradient Coding for General Data Assignments

Shreyas Jain, Luis Maßny, Christoph Hofmeister, Eitan Yaakobi, Rawad Bitar

TL;DR

The paper tackles exact gradient recovery in distributed learning under Byzantine adversaries by extending the gradient coding framework to general regular data assignments. It introduces an interactive, group-based protocol that combines MDS-encoded gradient sums with a systematic elimination tournament to identify malicious workers, achieving recovery with replication $\rho = s+u$ and at most $s+1-u$ rounds. The key contributions are a generalized construction of encoding/decoding matrices using Vandermonde/MDS codes, a scalable grouping strategy with provable optimality, and a log-scale communication overhead in the data size $p$. The approach broadens applicability beyond fractional repetition, enabling exact gradient recovery with reduced replication, and provides a foundation for future improvements in fundamental limits for general data assignments.

Abstract

We tackle the problem of Byzantine errors in distributed gradient descent within the Byzantine-resilient gradient coding framework. Our proposed solution can recover the exact full gradient in the presence of $s$ malicious workers with a data replication factor of only $s+1$. It generalizes previous solutions to any data assignment scheme that has a regular replication over all data samples. The scheme detects malicious workers through additional interactive communication and a small number of local computations at the main node, leveraging group-wise comparisons between workers with a provably optimal grouping strategy. The scheme requires at most $s$ interactive rounds that incur a total communication cost logarithmic in the number of data samples.

Interactive Byzantine-Resilient Gradient Coding for General Data Assignments

TL;DR

and at most

rounds. The key contributions are a generalized construction of encoding/decoding matrices using Vandermonde/MDS codes, a scalable grouping strategy with provable optimality, and a log-scale communication overhead in the data size

. The approach broadens applicability beyond fractional repetition, enabling exact gradient recovery with reduced replication, and provides a foundation for future improvements in fundamental limits for general data assignments.

Abstract

malicious workers with a data replication factor of only

. It generalizes previous solutions to any data assignment scheme that has a regular replication over all data samples. The scheme detects malicious workers through additional interactive communication and a small number of local computations at the main node, leveraging group-wise comparisons between workers with a provably optimal grouping strategy. The scheme requires at most

interactive rounds that incur a total communication cost logarithmic in the number of data samples.

Paper Structure (15 sections, 9 theorems, 12 equations, 2 figures)

This paper contains 15 sections, 9 theorems, 12 equations, 2 figures.

Introduction
Problem Setting
Scheme Construction for Arbitrary Allocations
Construction of the Encoding and Decoding Matrices
Workers' Response and Contradicting Groups
Elimination Tournament to Identify Malicious Workers
Analysis and Discussion
Optimality of the Grouping Strategy
Figures of Merit
Proof of \ref{['theorem: schemeoverview']}
Proof of \ref{['lemma:achievable-group-number']}
Pivot Analysis of a System of Linear Equations
Determining the Combining Vector $\mathbf{b}$
Determinant of Cauchy-like Matrix
Proof of \ref{['rem: partialgradient']}

Key Result

Theorem 1

The scheme constructed below is an $s\xspace$-BGC scheme with a parameter $u$, $1 \leq u \leq s\xspace + 1$, and requires $c_{}\xspace\leq s\xspace + 1 - u$ local computations, a replication $\rho_{}\xspace= s\xspace + u$ and a communication overhead $\kappa_{}\xspace\leq (r\xspace+2)(s\xspace + 1 -

Figures (2)

Figure 1: Gradient Coding: Each worker is assigned two partial gradients to compute and transmits a linear combination to the main node. Without malicious workers, the main node obtains the full gradient from the responses of any two workers. For the considered setting ($W_3$ is malicious), no existing gradient coding scheme can recover the exact gradient correctly. Our goal is to present a scheme that allows the main node to identify the malicious worker and reconstruct the full gradient correctly from the honest workers.
Figure 2: Flowchart illustrating the steps of the interactive protocol.

Theorems & Definitions (19)

Definition 1: $s\xspace$-BGC scheme hofmeisterTradingCommunication2023
Definition 2: Figures of merit hofmeisterTradingCommunication2023
Theorem 1
Lemma 1
proof
Remark 1
Lemma 2
proof
Lemma 3
proof
...and 9 more

Interactive Byzantine-Resilient Gradient Coding for General Data Assignments

TL;DR

Abstract

Interactive Byzantine-Resilient Gradient Coding for General Data Assignments

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (19)