Table of Contents
Fetching ...

Is It Possible to Backdoor Face Forgery Detection with Natural Triggers?

Xiaoxuan Han, Songlin Yang, Wei Wang, Ziwen He, Jing Dong

TL;DR

This work investigates the vulnerability of face forgery detectors to natural backdoor triggers embedded in latent spaces. It introduces two trigger schemes—an optimization-based latent trigger and a custom attribute-based trigger—and validates them on state-of-the-art generators such as StyleGAN and Stable Diffusion. The attacks achieve high success rates near 100% with minimal impact on benign accuracy and show resilience against existing defenses while remaining perceptually stealthy to humans. By extending the framework to diffusion-based AIGC, the study highlights practical security concerns and motivates the development of defenses that address latent-space and semantic backdoors across generative models.

Abstract

Deep neural networks have significantly improved the performance of face forgery detection models in discriminating Artificial Intelligent Generated Content (AIGC). However, their security is significantly threatened by the injection of triggers during model training (i.e., backdoor attacks). Although existing backdoor defenses and manual data selection can mitigate those using human-eye-sensitive triggers, such as patches or adversarial noises, the more challenging natural backdoor triggers remain insufficiently researched. To further investigate natural triggers, we propose a novel analysis-by-synthesis backdoor attack against face forgery detection models, which embeds natural triggers in the latent space. We thoroughly study such backdoor vulnerability from two perspectives: (1) Model Discrimination (Optimization-Based Trigger): we adopt a substitute detection model and find the trigger by minimizing the cross-entropy loss; (2) Data Distribution (Custom Trigger): we manipulate the uncommon facial attributes in the long-tailed distribution to generate poisoned samples without the supervision from detection models. Furthermore, to completely evaluate the detection models towards the latest AIGC, we utilize both state-of-the-art StyleGAN and Stable Diffusion for trigger generation. Finally, these backdoor triggers introduce specific semantic features to the generated poisoned samples (e.g., skin textures and smile), which are more natural and robust. Extensive experiments show that our method is superior from three levels: (1) Attack Success Rate: ours achieves a high attack success rate (over 99%) and incurs a small model accuracy drop (below 0.2%) with a low poisoning rate (less than 3%); (2) Backdoor Defense: ours shows better robust performance when faced with existing backdoor defense methods; (3) Human Inspection: ours is less human-eye-sensitive from a comprehensive user study.

Is It Possible to Backdoor Face Forgery Detection with Natural Triggers?

TL;DR

This work investigates the vulnerability of face forgery detectors to natural backdoor triggers embedded in latent spaces. It introduces two trigger schemes—an optimization-based latent trigger and a custom attribute-based trigger—and validates them on state-of-the-art generators such as StyleGAN and Stable Diffusion. The attacks achieve high success rates near 100% with minimal impact on benign accuracy and show resilience against existing defenses while remaining perceptually stealthy to humans. By extending the framework to diffusion-based AIGC, the study highlights practical security concerns and motivates the development of defenses that address latent-space and semantic backdoors across generative models.

Abstract

Deep neural networks have significantly improved the performance of face forgery detection models in discriminating Artificial Intelligent Generated Content (AIGC). However, their security is significantly threatened by the injection of triggers during model training (i.e., backdoor attacks). Although existing backdoor defenses and manual data selection can mitigate those using human-eye-sensitive triggers, such as patches or adversarial noises, the more challenging natural backdoor triggers remain insufficiently researched. To further investigate natural triggers, we propose a novel analysis-by-synthesis backdoor attack against face forgery detection models, which embeds natural triggers in the latent space. We thoroughly study such backdoor vulnerability from two perspectives: (1) Model Discrimination (Optimization-Based Trigger): we adopt a substitute detection model and find the trigger by minimizing the cross-entropy loss; (2) Data Distribution (Custom Trigger): we manipulate the uncommon facial attributes in the long-tailed distribution to generate poisoned samples without the supervision from detection models. Furthermore, to completely evaluate the detection models towards the latest AIGC, we utilize both state-of-the-art StyleGAN and Stable Diffusion for trigger generation. Finally, these backdoor triggers introduce specific semantic features to the generated poisoned samples (e.g., skin textures and smile), which are more natural and robust. Extensive experiments show that our method is superior from three levels: (1) Attack Success Rate: ours achieves a high attack success rate (over 99%) and incurs a small model accuracy drop (below 0.2%) with a low poisoning rate (less than 3%); (2) Backdoor Defense: ours shows better robust performance when faced with existing backdoor defense methods; (3) Human Inspection: ours is less human-eye-sensitive from a comprehensive user study.
Paper Structure (19 sections, 4 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Visualization comparisons of poisoned images generated by different backdoor attack methods. Our method proposes two ways of injecting natural triggers to the face forgery detection model training, including optimization-based triggers and custom triggers. Furthermore, we evaluate our methods on two state-of-the-art generators (StyleGAN stylegan and Stable Diffusion stable_diffusion) for comprehensive face forgery detection of Artificial Intelligent Generated Content (AIGC).
  • Figure 2: The overview of our proposed natural backdoor attack. The attacker embeds the trigger into the latent code and uses the poisoned code to generate the poisoned image. The trigger can be obtained under the guidance of a substitute detection model (Optimization-Based Trigger) or by leveraging editing direction for the attributes in the long-tailed distribution (Custom Trigger). After being trained on the dataset injected with poisoned samples, the infected detection model will classify images generated with the trigger as real images, while images produced without the trigger will be identified as fake ones.
  • Figure 3: The attribute distribution of benign samples (in the DFFD dataset dffd) and poisoned samples.
  • Figure 4: StyeGAN stylegan generated images using the optimization-based trigger with different $\alpha$. $\alpha=0$ represents generated benign samples.
  • Figure 5: StyeGAN stylegan generated images using the custom trigger with different $\beta_1$ and $\beta_2$. The custom trigger $t$ is $\beta_1 \cdot smile+\beta_2 \cdot age$.
  • ...and 8 more figures