Table of Contents
Fetching ...

Collision-based Watermark for Detecting Backdoor Manipulation in Federated Learning

Wenjie Li, Siying Gu, Yiming Li, Kangjie Chen, Zhili Chen, Tianwei Zhang, Shu-Tao Xia, Dacheng Tao

TL;DR

This work addresses backdoor manipulation in federated learning by identifying non-i.i.d. data and OOD bias as key weaknesses in existing detectors. It introduces Coward, a collision-based OOD watermark that enables an inverted proactive detection mechanism and uses regulated dual-mapping learning on OOD data. The method consists of watermark injection, interaction, and detection stages, with BN switching to stabilize semantics and reduce bias effects. Experiments on multiple image benchmarks show state-of-the-art detection performance, improved robustness to OOD bias, and resilience to adaptive backdoor attacks.

Abstract

As AI-generated content increasingly underpins real-world applications, its accompanying security risks, including privacy leakage and copyright infringement, have become growing concerns. In this context, Federated Learning (FL) offers a promising foundation for enhancing trustworthiness by enabling privacy-preserving collaborative learning over proprietary data. However, its practical adoption is critically threatened by backdoor-based model manipulation, where a small number of malicious clients can compromise the system and induce harmful content generation and decision-making. Although various detection methods have been proposed to detect such manipulation, we reveal that they are either disrupted by non-i.i.d. data distributions and random client participation, or misled by out-of-distribution (OOD) prediction bias, both of which are unique challenges in FL scenarios. To address these issues, we introduce a novel proactive detection method dubbed Coward, inspired by our discovery of multi-backdoor collision effects, in which consecutively planted, distinct backdoors significantly suppress earlier ones. Correspondingly, we modify the federated global model by injecting a carefully designed backdoor-collided watermark, implemented via regulated dual-mapping learning on OOD data. This design not only enables an inverted detection paradigm compared to existing proactive methods, thereby naturally counteracting the adverse impact of OOD prediction bias, but also introduces a low-disruptive training intervention that inherently limits the strength of OOD bias, leading to significantly fewer misjudgments. Extensive experiments on benchmark datasets show that Coward achieves state-of-the-art detection performance, effectively alleviates OOD prediction bias, and remains robust against potential adaptive manipulations.

Collision-based Watermark for Detecting Backdoor Manipulation in Federated Learning

TL;DR

This work addresses backdoor manipulation in federated learning by identifying non-i.i.d. data and OOD bias as key weaknesses in existing detectors. It introduces Coward, a collision-based OOD watermark that enables an inverted proactive detection mechanism and uses regulated dual-mapping learning on OOD data. The method consists of watermark injection, interaction, and detection stages, with BN switching to stabilize semantics and reduce bias effects. Experiments on multiple image benchmarks show state-of-the-art detection performance, improved robustness to OOD bias, and resilience to adaptive backdoor attacks.

Abstract

As AI-generated content increasingly underpins real-world applications, its accompanying security risks, including privacy leakage and copyright infringement, have become growing concerns. In this context, Federated Learning (FL) offers a promising foundation for enhancing trustworthiness by enabling privacy-preserving collaborative learning over proprietary data. However, its practical adoption is critically threatened by backdoor-based model manipulation, where a small number of malicious clients can compromise the system and induce harmful content generation and decision-making. Although various detection methods have been proposed to detect such manipulation, we reveal that they are either disrupted by non-i.i.d. data distributions and random client participation, or misled by out-of-distribution (OOD) prediction bias, both of which are unique challenges in FL scenarios. To address these issues, we introduce a novel proactive detection method dubbed Coward, inspired by our discovery of multi-backdoor collision effects, in which consecutively planted, distinct backdoors significantly suppress earlier ones. Correspondingly, we modify the federated global model by injecting a carefully designed backdoor-collided watermark, implemented via regulated dual-mapping learning on OOD data. This design not only enables an inverted detection paradigm compared to existing proactive methods, thereby naturally counteracting the adverse impact of OOD prediction bias, but also introduces a low-disruptive training intervention that inherently limits the strength of OOD bias, leading to significantly fewer misjudgments. Extensive experiments on benchmark datasets show that Coward achieves state-of-the-art detection performance, effectively alleviates OOD prediction bias, and remains robust against potential adaptive manipulations.

Paper Structure

This paper contains 25 sections, 6 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Our backdoor-collided detection versus the existing backdoor-coexistent method. By developing a backdoor-collided watermark, our method enables an inverted detection paradigm that remains effective under OOD prediction bias, a key challenge that limits current proactive detection methods.
  • Figure 1: OOD watermark collision under dynamic FL scenario. The collision effect remains highly effective in distinguishing malicious behavior under dynamic federated participation. The attacker exhibits a strong collision effect, while benign clients show diverse but generally higher levels of watermark retention.
  • Figure 2: The distraction effect of non-i.i.d. data on passive detection methods. Divergent client data distributions (left of Figure (a)) substantially reduce the suspiciousness of malicious clients, as reflected in both gradient norms (middle) and the update directions of benign models (right). In contrast, it increases the perceived suspiciousness of benign clients, particularly those with larger gradient norms, as evidenced by the positive correlation observed in Figure (b).
  • Figure 2: OOD watermark collision under centralized scenario. Our OOD watermark is planted as the second backdoor. The resulting collision effect is significant, regardless of whether the BN layer is switched. However, switching the BN layer creates a more pronounced performance discrepancy between benign finetuning and backdoor injection.
  • Figure 3: The misdirection effect of OOD bias against the existing proactive detection methods. With the attacker-specified target class set to '0' (red), we show five clients from the same round, presenting their OOD prediciton distribution, with results in no-planting case the first row and planting case in the second row. The gray subfigure shows the local data distribution. The red dashed line marks the detection threshold; classes with inspection accuracy above this line are flagged as malicious and highlighted in red border. The results indicate that: (1) Even without server-side planting, non-target classes may exhibit OOD-induced high inspection accuracy (orange) on benign clients; (2) With planting, misjudgment increases, and more classes tend to exhibit higher inspection accuracy.
  • ...and 7 more figures