Adaptive Gradient Clipping for Robust Federated Learning

Youssef Allouah; Rachid Guerraoui; Nirupam Gupta; Ahmed Jellouli; Geovani Rizk; John Stephan

Adaptive Gradient Clipping for Robust Federated Learning

Youssef Allouah, Rachid Guerraoui, Nirupam Gupta, Ahmed Jellouli, Geovani Rizk, John Stephan

TL;DR

The paper tackles robustness in distributed/federated learning under Byzantine workers by showing static gradient clipping is fragile across heterogeneity and attacks. It introduces Adaptive Robust Clipping (ARC), which dynamically determines clipping thresholds from input gradients while preserving the theoretical guarantees of Robust-DGD. The authors prove ARC maintains $(f,\kappa)$-robustness with an additive term, and demonstrate that, when initialization is well-chosen, ARC can improve asymptotic convergence, with empirical results on MNIST, Fashion-MNIST, and CIFAR-10 confirming significant robustness gains in highly heterogeneous and adversarial settings. The work highlights a meaningful gap between worst-case theory and practical performance, suggesting ARC as a reliable, tuning-free tool for robust distributed learning with practical impact in heterogeneous, Byzantine-prone environments.

Abstract

Robust federated learning aims to maintain reliable performance despite the presence of adversarial or misbehaving workers. While state-of-the-art (SOTA) robust distributed gradient descent (Robust-DGD) methods were proven theoretically optimal, their empirical success has often relied on pre-aggregation gradient clipping. However, existing static clipping strategies yield inconsistent results: enhancing robustness against some attacks while being ineffective or even detrimental against others. To address this limitation, we propose a principled adaptive clipping strategy, Adaptive Robust Clipping (ARC), which dynamically adjusts clipping thresholds based on the input gradients. We prove that ARC not only preserves the theoretical robustness guarantees of SOTA Robust-DGD methods but also provably improves asymptotic convergence when the model is well-initialized. Extensive experiments on benchmark image classification tasks confirm these theoretical insights, demonstrating that ARC significantly enhances robustness, particularly in highly heterogeneous and adversarial settings.

Adaptive Gradient Clipping for Robust Federated Learning

TL;DR

-robustness with an additive term, and demonstrate that, when initialization is well-chosen, ARC can improve asymptotic convergence, with empirical results on MNIST, Fashion-MNIST, and CIFAR-10 confirming significant robustness gains in highly heterogeneous and adversarial settings. The work highlights a meaningful gap between worst-case theory and practical performance, suggesting ARC as a reliable, tuning-free tool for robust distributed learning with practical impact in heterogeneous, Byzantine-prone environments.

Abstract

Paper Structure (44 sections, 20 theorems, 101 equations, 26 figures, 7 tables, 3 algorithms)

This paper contains 44 sections, 20 theorems, 101 equations, 26 figures, 7 tables, 3 algorithms.

Introduction
Problem Statement and Relevant Background
Adaptive Robust Clipping (ARC) and its Properties
Empirical Evaluation
Brittleness of Static Clipping and Superiority of ARC
Performance Gains of ARC Over Robust-DSGD
Improved Guarantee of Robust-DGD with ARC
Improvement of convergence guarantees
Practical scope of Theorem \ref{['thm:improv_ARC']}.
Influence of model initialization on empirical robustness
Related Work
Conclusion & Discussion
Acknowledgment.
Additional Details on ARC
Design of ARC
...and 29 more sections

Key Result

Lemma 2.3

Under $(G, B)$-gradient dissimilarity, a distributed learning algorithm is $(f, \varepsilon)$-resilient only if $\frac{f}{n} < \frac{1}{2+B^2}$ and $\varepsilon \geq \frac{1}{4} \cdot \frac{f}{n-(2+B^2)f} G^2$.

Figures (26)

Figure 1: Worst-case maximal accuracies of Robust-DSGD, with and without ARC, across several types of misbehavior for distributed MNIST with $10$ honest workers and under extreme heterogeneity. On the left, we vary the number of adversarial workers. On the right, we vary the initialization conditions by scaling a well-chosen set of initial parameters (CWTM $\circ$ NNM is used, and $f=1$). More details on the experimental setup can be found in Sections \ref{['sec_experiments']} and \ref{['sec:model_initialization_impact']}, and Appendix \ref{['app_exp_setup']}.
Figure 2: Impact of the clipping strategy, under varying heterogeneity levels, on the worst-case maximal accuracy achieved by Robust-DSGD on MNIST with $n = 15$ workers. CWTM $\circ$ NNM is used as aggregation. DSGD reaches at least 98.5% in accuracy in all heterogeneity regimes.
Figure 3: Performance of Robust-DSGD when using ARC and without clipping on MNIST. There are 10 honest workers and $f = 1$ adversarial worker executing FOE xie2020fall.
Figure 4: Worst-case maximal accuracies achieved by Robust-DSGD, with and without ARC, on heterogeneously-distributed MNIST with $10$ honest workers. Left:$f = 1$ adversarial worker among $n=11$ for varying levels of heterogeneity. Right:$\alpha = 0.5$ for varying $f$.
Figure 5: Worst-case maximal accuracies achieved by Robust-DSGD with CWTM, on MNIST ($\alpha = 0.1$), with 10 honest workers and 1 Byzantine worker. The x-axis represents worsening model initialization.
...and 21 more figures

Theorems & Definitions (35)

Definition 2.1
Definition 2.2
Lemma 2.3: Non-convex extension of Theorem 1 allouah2023robust
Definition 2.4
Lemma 2.5: Lemma 1 in allouah2023fixing
Lemma 2.6: Theorem 2 in allouah2023robust
Lemma 3.0
Theorem 3.1
Lemma 5.0: Bounded Output
Theorem 5.1
...and 25 more

Adaptive Gradient Clipping for Robust Federated Learning

TL;DR

Abstract

Adaptive Gradient Clipping for Robust Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (35)