Byzantine Robustness and Partial Participation Can Be Achieved at Once: Just Clip Gradient Differences

Grigory Malinovsky; Peter Richtárik; Samuel Horváth; Eduard Gorbunov

Byzantine Robustness and Partial Participation Can Be Achieved at Once: Just Clip Gradient Differences

Grigory Malinovsky, Peter Richtárik, Samuel Horváth, Eduard Gorbunov

TL;DR

This work introduces Byz-VR-MARINA-PP, the first distributed algorithm that achieves Byzantine robustness under partial client participation by coupling gradient clipping with recursive variance reduction and unbiased compression. The method extends the existing Byz-VR-MARINA framework to settings where only a subset of clients participates in each round, while clipping bounds the influence of Byzantine workers even when they form a local majority. The authors provide convergence guarantees for general non-convex objectives and linear convergence under Polyak–Łojasiewicz conditions, with results matching state-of-the-art when full participation is assumed. A practical heuristic shows how clipping can extend any Byzantine-robust method to partial participation, and extensive experiments confirm the benefits of clipping in Byzantine and partial-participation scenarios, suggesting meaningful improvements for scalable, robust distributed learning.

Abstract

Distributed learning has emerged as a leading paradigm for training large machine learning models. However, in real-world scenarios, participants may be unreliable or malicious, posing a significant challenge to the integrity and accuracy of the trained models. Byzantine fault tolerance mechanisms have been proposed to address these issues, but they often assume full participation from all clients, which is not always practical due to the unavailability of some clients or communication constraints. In our work, we propose the first distributed method with client sampling and provable tolerance to Byzantine workers. The key idea behind the developed method is the use of gradient clipping to control stochastic gradient differences in recursive variance reduction. This allows us to bound the potential harm caused by Byzantine workers, even during iterations when all sampled clients are Byzantine. Furthermore, we incorporate communication compression into the method to enhance communication efficiency. Under general assumptions, we prove convergence rates for the proposed method that match the existing state-of-the-art (SOTA) theoretical results. We also propose a heuristic on adjusting any Byzantine-robust method to a partial participation scenario via clipping.

Byzantine Robustness and Partial Participation Can Be Achieved at Once: Just Clip Gradient Differences

TL;DR

Abstract

Paper Structure (34 sections, 18 theorems, 157 equations, 4 figures, 1 algorithm)

This paper contains 34 sections, 18 theorems, 157 equations, 4 figures, 1 algorithm.

Introduction
Our Contributions
Related Work
Preliminaries
New Method: Byz-VR-MARINA-PP
Convergence Results
Heuristic extension of Byz-VR-MARINA-PP.
Numerical Experiments
Conclusion and Future Work
Extra Related Work
Further Comparison with data2021byzantine.
Byzantine robustness.
Variance reduction.
Partial participation and client sampling.
Communication compression.
...and 19 more sections

Key Result

Theorem 4.1

Let Assumptions assm:bounded-aggr, assm:het_simplified, assm:smoothness_simplified hold and $\lambda_{k+1} = 2{\cal L} \left\|x^{k+1} - x^k\right\|$. Assume that $0<\gamma \leq 1/{\cal L}(1+\sqrt{A}),$ where constant $A$ is defined as Then for all $K \geq 0$ the iterates produced by Byz-VR-MARINA-PP (Algorithm alg:byz_vr_marina) satisfy where $\widehat{D} = \frac{ 2\delta\mathcal{P}_{\mathcal{G

Figures (4)

Figure 1: The optimality gap $f(x^k) - f(x^*)$ for 3 different scenarios. We use coordinate-wise mean with bucketing equal to 2 as an aggregation and shift-back as an attack. We use the a9a dataset, where each worker accesses the full dataset with 15 good and 5 Byzantine workers. We do not use any compression. In each step, we sample 20% of clients uniformly at random to participate in the given round unless we specifically mention that we use full participation. Left: Linear convergence of Byz-VR-MARINA-PP with clipping versus non-convergence without clipping. Middle: Full versus partial participation, showing faster convergence with clipping. Right: Clipping multiplier $\lambda$ sensitivity, demonstrating consistent linear convergence across varying $\lambda$ values.
Figure 2: Training loss of 2 aggregation rules (CM, RFA) under 2 attacks (BF, SHB) on the MNIST dataset under heterogeneous data split with 20 clients, 5 of which are malicious. Complete experiments with 4 attacks (BF, LF, ALIE, SHB) and test accuracy are provided in Appendix \ref{['app:experiments']}.
Figure 3: Training loss of 2 aggregation rules (CM, RFA) under 4 attacks (BF, LF, ALIE, SHB) on the MNIST dataset under heterogeneous data split with 20 clients, 5 of which are malicious.
Figure 4: Testing accuracy of 2 aggregation rules (CM, RFA) under 4 attacks (BF, LF, ALIE, SHB) on the MNIST dataset under heterogeneous data split with 20 clients, 5 of which are malicious.

Theorems & Definitions (34)

Definition 2.1: $(\delta, c)$-Robust Aggregator
Definition 2.2: Unbiased compression
Theorem 4.1
Theorem 4.2
Lemma B.1
Lemma D.6
proof
Lemma D.7: Lemma 2 from li2021page
Lemma D.8
proof
...and 24 more

Byzantine Robustness and Partial Participation Can Be Achieved at Once: Just Clip Gradient Differences

TL;DR

Abstract

Byzantine Robustness and Partial Participation Can Be Achieved at Once: Just Clip Gradient Differences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (34)