DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection

Zhiyuan Yan; Yong Zhang; Xinhang Yuan; Siwei Lyu; Baoyuan Wu

DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection

Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, Baoyuan Wu

TL;DR

DeepfakeBench addresses the lack of a standardized benchmark for deepfake detection by providing a modular, extensible platform that unifies data processing, detectors, and evaluation protocols across 9 datasets. It implements 15 detectors spanning naive, spatial, and frequency categories and includes 9 datasets with standardized frame-based evaluation. The paper presents extensive experiments analyzing data augmentation, backbone architectures (notably Xception and EfficientNet-B4), and the impact of pretraining, revealing notable generalization gaps across cross-domain and cross-manipulation settings. The benchmark and analyses enable fairer comparisons and offer insights to guide future detector design, with code and results publicly available on GitHub.

Abstract

A critical yet frequently overlooked challenge in the field of deepfake detection is the lack of a standardized, unified, comprehensive benchmark. This issue leads to unfair performance comparisons and potentially misleading results. Specifically, there is a lack of uniformity in data processing pipelines, resulting in inconsistent data inputs for detection models. Additionally, there are noticeable differences in experimental settings, and evaluation strategies and metrics lack standardization. To fill this gap, we present the first comprehensive benchmark for deepfake detection, called DeepfakeBench, which offers three key contributions: 1) a unified data management system to ensure consistent input across all detectors, 2) an integrated framework for state-of-the-art methods implementation, and 3) standardized evaluation metrics and protocols to promote transparency and reproducibility. Featuring an extensible, modular-based codebase, DeepfakeBench contains 15 state-of-the-art detection methods, 9 deepfake datasets, a series of deepfake detection evaluation protocols and analysis tools, as well as comprehensive evaluations. Moreover, we provide new insights based on extensive analysis of these evaluations from various perspectives (e.g., data augmentations, backbones). We hope that our efforts could facilitate future research and foster innovation in this increasingly critical domain. All codes, evaluations, and analyses of our benchmark are publicly available at https://github.com/SCLBD/DeepfakeBench.

DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection

TL;DR

Abstract

Paper Structure (56 sections, 22 figures, 11 tables, 1 algorithm)

This paper contains 56 sections, 22 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Deepfake Generation
Deepfake Detection
Related Deepfake Surveys and Benchmarks
Our Benchmark
Datasets and Detectors
Datasets
Detectors
Codebase
Data Processing Module
Training Module
Evaluation and Analysis Module
Evaluations and Analysis
Experimental Setup
...and 41 more sections

Figures (22)

Figure 1: The general structure of the modular-based codebase of DeepfakeBench.
Figure 2: Visualization of heat maps showing the cross-manipulation evaluation results. The color represents the AUC performance index of the corresponding detector under specific test data, and the darker the color, the better the performance. All heat maps use a uniform color scale for performance comparison.
Figure 3: Visualization of different augmentation methods. We apply two detectors, one in the spatial domain (Xception) and one in the frequency domain (SPSL), and then use 8 different augmentation strategies to measure the effect on 5 test datasets.
Figure 4: Visualization of the performance of 3 different backbones, ResNet, EfficientNet-B4, and Xception, across 4 different detectors, CORE, SPSL, UCF, and Face X-ray. The evaluation is conducted using the AUC metric, following the settings described in the previous section.
Figure 5: Visualization of the effect of pre-trained weights on three different architectures. The evaluation is conducted using the AUC metric, following the settings described in the previous section.
...and 17 more figures

DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection

TL;DR

Abstract

DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (22)