High Dimensional Distributed Gradient Descent with Arbitrary Number of Byzantine Attackers
Wenyu Liu, Tianqiang Huang, Pengfei Zhang, Zong Ke, Minghui Min, Puning Zhao
TL;DR
This work tackles Byzantine-robust distributed learning in high dimensions by introducing a direct high-dimensional semi-verified mean estimation method. The approach identifies a large-variance subspace and uses a small auxiliary clean dataset to estimate coordinates within that subspace, while leveraging corrupted gradient vectors for the orthogonal components; this yields minimax-optimal rates and removes the $\sqrt{d}$ scaling common in prior methods. The semi-verified estimator serves as a gradient aggregator, enabling robust distributed optimization under arbitrary numbers of Byzantine attackers; theoretical upper and lower bounds show dimension-free performance, and experiments on synthetic data and MNIST confirm substantial gains at high dimensionality. Collectively, the method provides a scalable, provably robust solution for federated learning settings with many untrusted workers and very large models.
Abstract
Adversarial attacks pose a major challenge to distributed learning systems, prompting the development of numerous robust learning methods. However, most existing approaches suffer from the curse of dimensionality, i.e. the error increases with the number of model parameters. In this paper, we make a progress towards high dimensional problems, under arbitrary number of Byzantine attackers. The cornerstone of our design is a direct high dimensional semi-verified mean estimation method. The idea is to identify a subspace with large variance. The components of the mean value perpendicular to this subspace are estimated using corrupted gradient vectors uploaded from worker machines, while the components within this subspace are estimated using auxiliary dataset. As a result, a combination of large corrupted dataset and small clean dataset yields significantly better performance than using them separately. We then apply this method as the aggregator for distributed learning problems. The theoretical analysis shows that compared with existing solutions, our method gets rid of $\sqrt{d}$ dependence on the dimensionality, and achieves minimax optimal statistical rates. Numerical results validate our theory as well as the effectiveness of the proposed method.
