Efficient Federated Learning against Byzantine Attacks and Data Heterogeneity via Aggregating Normalized Gradients
Shiyuan Zuo, Xingrun Yan, Rongfei Fan, Li Shen, Puning Zhao, Jie Xu, Han Hu
TL;DR
This work addresses robustness in Federated Learning under Byzantine attacks and non-IID data by introducing Fed-NGA, a lightweight aggregation rule that normalizes client gradients before weighted averaging. The method achieves a favorable aggregation time of $\mathcal{O}(pM)$ and is proven to converge for non-convex losses to a neighborhood of stationary points at a rate of $\mathcal{O}(1/T^{\frac{1}{2}-\delta})$, with conditions under which zero optimality gap is attainable. Theoretical results cover two variants of assumptions on gradient noise and data heterogeneity, and extensive experiments across MNIST, CIFAR10, and TinyImageNet demonstrate Fed-NGA’s robustness to multiple Byzantine attacks and substantial time-efficiency gains over baselines. The findings suggest Fed-NGA as a scalable, Byzantine-robust solution for non-IID FL, with practical impact for large-scale distributed learning systems.
Abstract
Federated Learning (FL) enables multiple clients to collaboratively train models without sharing raw data, but is vulnerable to Byzantine attacks and data heterogeneity, which can severely degrade performance. Existing Byzantine-robust approaches tackle data heterogeneity, but incur high computational overhead during gradient aggregation, thereby slowing down the training process. To address this issue, we propose a simple yet effective Federated Normalized Gradients Algorithm (Fed-NGA), which performs aggregation by merely computing the weighted mean of the normalized gradients from each client. This approach yields a favorable time complexity of $\mathcal{O}(pM)$, where $p$ is the model dimension and $M$ is the number of clients. We rigorously prove that Fed-NGA is robust to both Byzantine faults and data heterogeneity. For non-convex loss functions, Fed-NGA achieves convergence to a neighborhood of stationary points under general assumptions, and further attains zero optimality gap under some mild conditions, which is an outcome rarely achieved in existing literature. In both cases, the convergence rate is $\mathcal{O}(1/T^{\frac{1}{2} - δ})$, where $T$ denotes the number of iterations and $δ\in (0, 1/2)$. Experimental results on benchmark datasets confirm the superior time efficiency and convergence performance of Fed-NGA over existing methods.
