Table of Contents
Fetching ...

Byzantine-Robust Federated Learning over Ring-All-Reduce Distributed Computing

Minghong Fang, Zhuqing Liu, Xuecen Zhao, Jia Liu

TL;DR

This work addresses the scalability and security challenges of federated learning by coupling Byzantine robustness with ring-all-reduce architectures. BRACE introduces 1-bit gradient quantization, neighbor sub-vector exchange, and a dimension-wise consensus threshold to mitigate poisoning while preserving bandwidth efficiency, achieving an $O(1/T)$ convergence rate under Byzantine attacks. The method demonstrates robustness and reduced communication costs on Fashion-MNIST and CIFAR-10 with large client counts, outperforming server-based and standard RAR defenses. The results suggest BRACE enables scalable, secure, serverless FL suitable for large-scale distributed deployments with heterogeneous data.

Abstract

Federated learning (FL) has gained attention as a distributed learning paradigm for its data privacy benefits and accelerated convergence through parallel computation. Traditional FL relies on a server-client (SC) architecture, where a central server coordinates multiple clients to train a global model, but this approach faces scalability challenges due to server communication bottlenecks. To overcome this, the ring-all-reduce (RAR) architecture has been introduced, eliminating the central server and achieving bandwidth optimality. However, the tightly coupled nature of RAR's ring topology exposes it to unique Byzantine attack risks not present in SC-based FL. Despite its potential, designing Byzantine-robust RAR-based FL algorithms remains an open problem. To address this gap, we propose BRACE (Byzantine-robust ring-all-reduce), the first RAR-based FL algorithm to achieve both Byzantine robustness and communication efficiency. We provide theoretical guarantees for the convergence of BRACE under Byzantine attacks, demonstrate its bandwidth efficiency, and validate its practical effectiveness through experiments. Our work offers a foundational understanding of Byzantine-robust RAR-based FL design.

Byzantine-Robust Federated Learning over Ring-All-Reduce Distributed Computing

TL;DR

This work addresses the scalability and security challenges of federated learning by coupling Byzantine robustness with ring-all-reduce architectures. BRACE introduces 1-bit gradient quantization, neighbor sub-vector exchange, and a dimension-wise consensus threshold to mitigate poisoning while preserving bandwidth efficiency, achieving an convergence rate under Byzantine attacks. The method demonstrates robustness and reduced communication costs on Fashion-MNIST and CIFAR-10 with large client counts, outperforming server-based and standard RAR defenses. The results suggest BRACE enables scalable, secure, serverless FL suitable for large-scale distributed deployments with heterogeneous data.

Abstract

Federated learning (FL) has gained attention as a distributed learning paradigm for its data privacy benefits and accelerated convergence through parallel computation. Traditional FL relies on a server-client (SC) architecture, where a central server coordinates multiple clients to train a global model, but this approach faces scalability challenges due to server communication bottlenecks. To overcome this, the ring-all-reduce (RAR) architecture has been introduced, eliminating the central server and achieving bandwidth optimality. However, the tightly coupled nature of RAR's ring topology exposes it to unique Byzantine attack risks not present in SC-based FL. Despite its potential, designing Byzantine-robust RAR-based FL algorithms remains an open problem. To address this gap, we propose BRACE (Byzantine-robust ring-all-reduce), the first RAR-based FL algorithm to achieve both Byzantine robustness and communication efficiency. We provide theoretical guarantees for the convergence of BRACE under Byzantine attacks, demonstrate its bandwidth efficiency, and validate its practical effectiveness through experiments. Our work offers a foundational understanding of Byzantine-robust RAR-based FL design.

Paper Structure

This paper contains 11 sections, 1 theorem, 11 equations, 4 figures, 2 tables.

Key Result

Theorem 1

If $0 \leq Pr\left( I_{\left(\sum_{i \in [n]} \text{sign}\left(\bm{g}_i^t[k]\right) > \lambda\right)} = 0 \mid \mathcal{F}_t \right) < 0.5$ for $i \in [n]$ and $k \in [d]$, where $\mathcal{F}_t$ represents the filtration of all random variables at round $t$, and $I$ is an indicator function, then un where $T$ is the total training rounds, $\bm{w}^1$ is the initial model, and $f^*$ is the minimum o

Figures (4)

  • Figure 1: Illustration of the ring-all-reduce (RAR) process.
  • Figure 2: An example of the $\mathsf{BRACE}~$ algorithm.
  • Figure 3: Impact of malicious client fraction, Non-IID degree, and total clients using the Fashion-MNIST dataset.
  • Figure 4: Impact of $\lambda$ using the Fashion-MNIST dataset.

Theorems & Definitions (2)

  • Theorem 1
  • Remark