Table of Contents
Fetching ...

A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries

Xin Zhang, Yuqi Song, Fei Zuo

TL;DR

This work addresses the rising threat of AI-generated facial forgeries by proposing a dual-branch convolutional architecture that jointly exploits spatial (RGB) and frequency-domain cues. A channel attention fusion module combines the complementary features, while the FSC Loss (a combination of focal loss, supervised contrastive loss, and a frequency-center margin term) enforces discriminative, robust embeddings across both domains. Evaluated on the DiFF benchmark, the method achieves strong in-domain performance across T2I, I2I, FS, and FE, and shows robust cross-domain generalization, often surpassing average human accuracy. The approach demonstrates the feasibility of generalized, trustworthy forgery detectors suitable for real-world AI security ecosystems.

Abstract

The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network's learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model's effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.

A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries

TL;DR

This work addresses the rising threat of AI-generated facial forgeries by proposing a dual-branch convolutional architecture that jointly exploits spatial (RGB) and frequency-domain cues. A channel attention fusion module combines the complementary features, while the FSC Loss (a combination of focal loss, supervised contrastive loss, and a frequency-center margin term) enforces discriminative, robust embeddings across both domains. Evaluated on the DiFF benchmark, the method achieves strong in-domain performance across T2I, I2I, FS, and FE, and shows robust cross-domain generalization, often surpassing average human accuracy. The approach demonstrates the feasibility of generalized, trustworthy forgery detectors suitable for real-world AI security ecosystems.

Abstract

The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network's learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model's effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.

Paper Structure

This paper contains 9 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Detailed pipelines for generating diffusion-based facial forgeries under four conditions. (a) T2I: Prompts are extracted from face images and fed into a diffusion model to synthesize new faces. (b) I2I: The dataset is divided by identities (e.g., identity $a$ to $n$), and each identity is used to fine-tune a separate diffusion model to generate diverse images of that specific identity. (c) FS: The dataset is split into source and target subsets; the diffusion model swaps identity features between the two subsets to produce realistic swapped faces. (d) FE: Prompts and face images are extracted, the prompts are modified, and the updated prompts are passed to a diffusion model to edit attributes such as expression, age, or style while preserving the core facial features.
  • Figure 2: Overview of the proposed dual-branch detection framework. The RGB image is processed through a ResNet-50 backbone, while its frequency representation (obtained via FFT) is passed through a ResNet-34. The resulting features are concatenated and refined using a Channel Attention Module, which adaptively emphasizes informative channels. The fused feature is then pooled and passed through fully connected layers for real-vs-fake prediction.