Tracing Back the Malicious Clients in Poisoning Attacks to Federated Learning
Yuqi Jia, Minghong Fang, Hongbin Liu, Jinghuai Zhang, Neil Zhenqiang Gong
TL;DR
The paper addresses the vulnerability of federated learning to targeted poisoning attacks that cause misclassification of a chosen target input. It introduces FLForensics, a post-deployment poison-forensics method that traces malicious clients by computing per-client influence scores across stored check points and then clustering two-dimensional scores (s_i, s_i′) using HDBSCAN, aided by a non-target input to disambiguate benign from malicious clients. The authors provide theoretical guarantees under a formal poisoning definition and demonstrate, across five datasets and multiple attack types, that FLForensics can accurately identify malicious clients even when training-phase defenses fail and data are non‑IID. They also show robustness to adaptive attacks and discuss practical recovery steps after detection, including integration with other defense strategies and extension to centralized learning. The work offers a practical, post-deployment tool to improve accountability and resilience in FL systems, with significant implications for security in privacy-preserving collaboration contexts.
Abstract
Poisoning attacks compromise the training phase of federated learning (FL) such that the learned global model misclassifies attacker-chosen inputs called target inputs. Existing defenses mainly focus on protecting the training phase of FL such that the learnt global model is poison free. However, these defenses often achieve limited effectiveness when the clients' local training data is highly non-iid or the number of malicious clients is large, as confirmed in our experiments. In this work, we propose FLForensics, the first poison-forensics method for FL. FLForensics complements existing training-phase defenses. In particular, when training-phase defenses fail and a poisoned global model is deployed, FLForensics aims to trace back the malicious clients that performed the poisoning attack after a misclassified target input is identified. We theoretically show that FLForensics can accurately distinguish between benign and malicious clients under a formal definition of poisoning attack. Moreover, we empirically show the effectiveness of FLForensics at tracing back both existing and adaptive poisoning attacks on five benchmark datasets.
