Table of Contents
Fetching ...

GM-DF: Generalized Multi-Scenario Deepfake Detection

Yingxin Lai, Zitong Yu, Jing Yang, Bin Li, Xiangui Kang, Linlin Shen

TL;DR

This work tackles the problem of generalization in deepfake detection across diverse datasets. It introduces GM-DF, a Generalized Multi-Scenario Deepfake Detection framework that combines domain-specific feature extraction via a hybrid expert MoE, CLIP-based common feature alignment, and a masked image modeling head, all trained under a domain-aware meta-learning objective with a domain-alignment loss. The approach yields state-of-the-art performance on both traditional single-domain protocols and a newly proposed multi-domain benchmark, demonstrating strong cross-domain generalization and robustness to distortions. The results suggest a viable path toward unified forgery detectors capable of operating across varied real-world scenarios and datasets, with potential implications for scalable forgery foundation models.

Abstract

Existing face forgery detection usually follows the paradigm of training models in a single domain, which leads to limited generalization capacity when unseen scenarios and unknown attacks occur. In this paper, we elaborately investigate the generalization capacity of deepfake detection models when jointly trained on multiple face forgery detection datasets. We first find a rapid degradation of detection accuracy when models are directly trained on combined datasets due to the discrepancy across collection scenarios and generation methods. To address the above issue, a Generalized Multi-Scenario Deepfake Detection framework (GM-DF) is proposed to serve multiple real-world scenarios by a unified model. First, we propose a hybrid expert modeling approach for domain-specific real/forgery feature extraction. Besides, as for the commonality representation, we use CLIP to extract the common features for better aligning visual and textual features across domains. Meanwhile, we introduce a masked image reconstruction mechanism to force models to capture rich forged details. Finally, we supervise the models via a domain-aware meta-learning strategy to further enhance their generalization capacities. Specifically, we design a novel domain alignment loss to strongly align the distributions of the meta-test domains and meta-train domains. Thus, the updated models are able to represent both specific and common real/forgery features across multiple datasets. In consideration of the lack of study of multi-dataset training, we establish a new benchmark leveraging multi-source data to fairly evaluate the models' generalization capacity on unseen scenarios. Both qualitative and quantitative experiments on five datasets conducted on traditional protocols as well as the proposed benchmark demonstrate the effectiveness of our approach.

GM-DF: Generalized Multi-Scenario Deepfake Detection

TL;DR

This work tackles the problem of generalization in deepfake detection across diverse datasets. It introduces GM-DF, a Generalized Multi-Scenario Deepfake Detection framework that combines domain-specific feature extraction via a hybrid expert MoE, CLIP-based common feature alignment, and a masked image modeling head, all trained under a domain-aware meta-learning objective with a domain-alignment loss. The approach yields state-of-the-art performance on both traditional single-domain protocols and a newly proposed multi-domain benchmark, demonstrating strong cross-domain generalization and robustness to distortions. The results suggest a viable path toward unified forgery detectors capable of operating across varied real-world scenarios and datasets, with potential implications for scalable forgery foundation models.

Abstract

Existing face forgery detection usually follows the paradigm of training models in a single domain, which leads to limited generalization capacity when unseen scenarios and unknown attacks occur. In this paper, we elaborately investigate the generalization capacity of deepfake detection models when jointly trained on multiple face forgery detection datasets. We first find a rapid degradation of detection accuracy when models are directly trained on combined datasets due to the discrepancy across collection scenarios and generation methods. To address the above issue, a Generalized Multi-Scenario Deepfake Detection framework (GM-DF) is proposed to serve multiple real-world scenarios by a unified model. First, we propose a hybrid expert modeling approach for domain-specific real/forgery feature extraction. Besides, as for the commonality representation, we use CLIP to extract the common features for better aligning visual and textual features across domains. Meanwhile, we introduce a masked image reconstruction mechanism to force models to capture rich forged details. Finally, we supervise the models via a domain-aware meta-learning strategy to further enhance their generalization capacities. Specifically, we design a novel domain alignment loss to strongly align the distributions of the meta-test domains and meta-train domains. Thus, the updated models are able to represent both specific and common real/forgery features across multiple datasets. In consideration of the lack of study of multi-dataset training, we establish a new benchmark leveraging multi-source data to fairly evaluate the models' generalization capacity on unseen scenarios. Both qualitative and quantitative experiments on five datasets conducted on traditional protocols as well as the proposed benchmark demonstrate the effectiveness of our approach.
Paper Structure (25 sections, 14 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 25 sections, 14 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Challenges in training a detector from multiple datasets. The generalization capacity of the baseline Xception xception trained on FF++ff++&Celebcelebdf datasets drops sharply while the proposed method GM-DF benefits obviously from multi-dataset training.
  • Figure 2: The framework of the proposed method. It integrates meta-learning modeling with image-text contrastive learning. It comprises three pivotal components: Dataset-Embedding Generator (DEG) and a Multi-Dataset Representation (MDP), as well as a Meta-Domain-Embedding Optimizer(MDEO). Firstly, the DEG incorporates a Dataset Information Layer (DIL) and a dynamic text feature affine aimed at mapping discriminative features unique to each domain, and the second part MDP is the face mask image modeling (MIM) reconstruction module, which provides additional detail information for the global features of CLIP. To consider the difference between each domain, we propose to use the higher-order statistical features in Domain Alignment (DA) loss to constrain the feature distribution. In this process, MDEO was used to optimize the learned two features.
  • Figure 3: Histograms of feature values in a randomly selected channel, where features are computed from the block of a convolution based on Xception xception trained on the dataset of four domains wildfakedffff++celebdf.
  • Figure 4: The commonly used frequency domain detection model M2TR's M2TR frequency domain visualization on the FF++ c40 ff++, FF++ c23 ff++, DFDC dfdc, Celeb-DF celebdf, WildDeepfake wildfake, and DFF dff datasets.
  • Figure 5: Visualization of typical samples from five datasets, i.e., FF++ ff++, Celeb-DF (v2) celebdf, DFF dff, WDF wildfake, and DFDC dfdc.
  • ...and 4 more figures