Byzantine Robustness and Partial Participation Can Be Achieved at Once: Just Clip Gradient Differences
Grigory Malinovsky, Peter Richtárik, Samuel Horváth, Eduard Gorbunov
TL;DR
This work introduces Byz-VR-MARINA-PP, the first distributed algorithm that achieves Byzantine robustness under partial client participation by coupling gradient clipping with recursive variance reduction and unbiased compression. The method extends the existing Byz-VR-MARINA framework to settings where only a subset of clients participates in each round, while clipping bounds the influence of Byzantine workers even when they form a local majority. The authors provide convergence guarantees for general non-convex objectives and linear convergence under Polyak–Łojasiewicz conditions, with results matching state-of-the-art when full participation is assumed. A practical heuristic shows how clipping can extend any Byzantine-robust method to partial participation, and extensive experiments confirm the benefits of clipping in Byzantine and partial-participation scenarios, suggesting meaningful improvements for scalable, robust distributed learning.
Abstract
Distributed learning has emerged as a leading paradigm for training large machine learning models. However, in real-world scenarios, participants may be unreliable or malicious, posing a significant challenge to the integrity and accuracy of the trained models. Byzantine fault tolerance mechanisms have been proposed to address these issues, but they often assume full participation from all clients, which is not always practical due to the unavailability of some clients or communication constraints. In our work, we propose the first distributed method with client sampling and provable tolerance to Byzantine workers. The key idea behind the developed method is the use of gradient clipping to control stochastic gradient differences in recursive variance reduction. This allows us to bound the potential harm caused by Byzantine workers, even during iterations when all sampled clients are Byzantine. Furthermore, we incorporate communication compression into the method to enhance communication efficiency. Under general assumptions, we prove convergence rates for the proposed method that match the existing state-of-the-art (SOTA) theoretical results. We also propose a heuristic on adjusting any Byzantine-robust method to a partial participation scenario via clipping.
