A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries
Xin Zhang, Yuqi Song, Fei Zuo
TL;DR
This work addresses the rising threat of AI-generated facial forgeries by proposing a dual-branch convolutional architecture that jointly exploits spatial (RGB) and frequency-domain cues. A channel attention fusion module combines the complementary features, while the FSC Loss (a combination of focal loss, supervised contrastive loss, and a frequency-center margin term) enforces discriminative, robust embeddings across both domains. Evaluated on the DiFF benchmark, the method achieves strong in-domain performance across T2I, I2I, FS, and FE, and shows robust cross-domain generalization, often surpassing average human accuracy. The approach demonstrates the feasibility of generalized, trustworthy forgery detectors suitable for real-world AI security ecosystems.
Abstract
The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network's learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model's effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.
