A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

Yijun Bei; Hengrui Lou; Jinsong Geng; Erteng Liu; Lechao Cheng; Jie Song; Mingli Song; Zunlei Feng

A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

Yijun Bei, Hengrui Lou, Jinsong Geng, Erteng Liu, Lechao Cheng, Jie Song, Mingli Song, Zunlei Feng

TL;DR

This work introduces DeepFaceGen, the first large-scale, versatile benchmark for face forgery detection that jointly covers localized editing and full-image generation across image and video modalities. By assembling over 350k forged images and 423k forged videos produced with 34 generation techniques, along with authentic samples, the authors evaluate 13 mainstream forgery detectors to analyze performance, generalization, and feature representations. Key findings include the importance of detailed feature extraction and high-frequency cues, the greater challenge posed by localized edits, and contrasting generalization patterns between full-image and localized forgery data. The dataset and analyses aim to accelerate robust, generalizable forgery detection and to guide future directions like dynamic benchmarking and self-evolving defense strategies that can keep pace with rapid advances in generative technologies.

Abstract

With the rapid development of AI-generated content (AIGC) technology, the production of realistic fake facial images and videos that deceive human visual perception has become possible. Consequently, various face forgery detection techniques have been proposed to identify such fake facial content. However, evaluating the effectiveness and generalizability of these detection techniques remains a significant challenge. To address this, we have constructed a large-scale evaluation benchmark called DeepFaceGen, aimed at quantitatively assessing the effectiveness of face forgery detection and facilitating the iterative development of forgery detection technology. DeepFaceGen consists of 776,990 real face image/video samples and 773,812 face forgery image/video samples, generated using 34 mainstream face generation techniques. During the construction process, we carefully consider important factors such as content diversity, fairness across ethnicities, and availability of comprehensive labels, in order to ensure the versatility and convenience of DeepFaceGen. Subsequently, DeepFaceGen is employed in this study to evaluate and analyze the performance of 13 mainstream face forgery detection techniques from various perspectives. Through extensive experimental analysis, we derive significant findings and propose potential directions for future research. The code and dataset for DeepFaceGen are available at https://github.com/HengruiLou/DeepFaceGen.

A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

TL;DR

Abstract

Paper Structure (31 sections, 12 figures, 3 tables)

This paper contains 31 sections, 12 figures, 3 tables.

Introduction
Related Works
Evaluation Dataset Construction
Forged Face Sample Generation
Authentic Face sample Collection
Dataset Summarization
Benchmark Evaluation and Analysis
Evaluation of Mainstream Forgery Detection Techniques
Image-level Evaluation and Analysis
Video-level Evaluation and Analysis
Generalization Ability Evaluation to Different Forgery Techniques
Visualization Analysis of Forgery Detection Features
Conclusion
Appendix
Survey of Face Forgery Technology and Face Forgery Detection Technology
...and 16 more sections

Figures (12)

Figure 1: Image-level Performance comparison of different forgery detection techniques. (a) Average detection performance ranking. (b) Detection performance for different generation techniques. (c) Detection performance for different generation manners and frameworks (marked with ①-⑧).
Figure 2: Video-level Performance comparison of different forgery detection techniques. (a) Average detection performance histogram. (b) Detection performance for different generation techniques. (c) Detection performance for different generation manners and frameworks (marked with ①-③).
Figure 3: The cross-generalization ability verification matrices for image-level (a) and video-level (b) datasets. The training and testing samples, generated by various forgery techniques, are represented on the vertical and horizontal axes. The denotation for each number is provided in the Appendix A.5.
Figure 4: The forgery feature visualization for different forgey techniques on image-level (a) and video-level (b) datasets with t-SNE t-SNE.
Figure 5: Full-image generation methods/products (above the timeline) and forgery detection techniques (below the timeline) are shown on a chronological timeline. GAN, Autoregressive, and Diffusion are marked with blue, orange, and black fonts, respectively.
...and 7 more figures

A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

TL;DR

Abstract

A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (12)