Asynchronous Byzantine Federated Learning
Bart Cox, Abele Mălan, Lydia Y. Chen, Jérémie Decouchant
TL;DR
Catalyst addresses the challenge of robust asynchronous Federated Learning in the presence of Byzantine clients by applying clustering-based filtering to the earliest updates and then incorporating late updates to preserve liveness. It operates without requiring a server-held auxiliary dataset and guarantees progress with at least $2f+1$ participating clients, leveraging both fast and slow clients through a carefully designed update aggregation that accounts for staleness. Empirical results across MNIST, CIFAR-10, and WikiText-2 show Catalyst converges faster and achieves higher accuracy under gradient perturbation, gradient inversion, and backdoor attacks than state-of-the-art baselines like FedAsync, Kardam, and BASGD. The approach demonstrates strong resilience to Byzantine behavior, maintains competitive benign performance, and scales well with increasing client counts and varying Byzantine fractions, suggesting practical applicability for real-world asynchronous FL deployments.
Abstract
Federated learning (FL) enables a set of geographically distributed clients to collectively train a model through a server. Classically, the training process is synchronous, but can be made asynchronous to maintain its speed in presence of slow clients and in heterogeneous networks. The vast majority of Byzantine fault-tolerant FL systems however rely on a synchronous training process. Our solution is one of the first Byzantine-resilient and asynchronous FL algorithms that does not require an auxiliary server dataset and is not delayed by stragglers, which are shortcomings of previous works. Intuitively, the server in our solution waits to receive a minimum number of updates from clients on its latest model to safely update it, and is later able to safely leverage the updates that late clients might send. We compare the performance of our solution with state-of-the-art algorithms on both image and text datasets under gradient inversion, perturbation, and backdoor attacks. Our results indicate that our solution trains a model faster than previous synchronous FL solution, and maintains a higher accuracy, up to 1.54x and up to 1.75x for perturbation and gradient inversion attacks respectively, in the presence of Byzantine clients than previous asynchronous FL solutions.
