Table of Contents
Fetching ...

Generalized Design Choices for Deepfake Detectors

Lorenzo Pellegrini, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Marco Prati, Marco Ramilli

TL;DR

This study tackles the generalization challenge of deepfake detectors posed by evolving generators. It systematically evaluates design choices across training and inference, using the AI-GenBench temporal benchmark and multiple backbones (e.g., ResNet-50 CLIP, ViT-L CLIP, DINOv2). Key findings show that realistic augmentation pipelines, especially evaluation-based augmentations with JPEG post-processing, full-image resizing, and a four-epoch schedule with $am=4$, consistently boost Next Period AUROC; direct binary optimization remains robust, though a dual-head multiclass auxiliary loss can aid larger models. For continual updates, replay-based strategies, particularly harmonic replay, provide a practical balance between adaptation and retention, enabling near-full retraining performance at reduced compute. The integrated best-of configuration achieves state-of-the-art results on AI-GenBench (e.g., 97.36% Next Period AUROC with DINOv2), offering actionable, architecture-agnostic guidelines for deploying and updating robust deepfake detectors in real-world settings.

Abstract

The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.

Generalized Design Choices for Deepfake Detectors

TL;DR

This study tackles the generalization challenge of deepfake detectors posed by evolving generators. It systematically evaluates design choices across training and inference, using the AI-GenBench temporal benchmark and multiple backbones (e.g., ResNet-50 CLIP, ViT-L CLIP, DINOv2). Key findings show that realistic augmentation pipelines, especially evaluation-based augmentations with JPEG post-processing, full-image resizing, and a four-epoch schedule with , consistently boost Next Period AUROC; direct binary optimization remains robust, though a dual-head multiclass auxiliary loss can aid larger models. For continual updates, replay-based strategies, particularly harmonic replay, provide a practical balance between adaptation and retention, enabling near-full retraining performance at reduced compute. The integrated best-of configuration achieves state-of-the-art results on AI-GenBench (e.g., 97.36% Next Period AUROC with DINOv2), offering actionable, architecture-agnostic guidelines for deploying and updating robust deepfake detectors in real-world settings.

Abstract

The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.

Paper Structure

This paper contains 24 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: The different dimensions explored in this work, to optimize the training, inference and incremental update (grayed) of deepfake detectors.
  • Figure 2: Impact of different augmentation pipelines on Next Period AUROC. Results are shown for all three detector backbones.
  • Figure 3: Effect of varying the augmentation multiplier (top) and number of epochs (bottom) on generalization performance (average Next Period AUROC, %) across models.
  • Figure 4: Comparison of training and inference input processing strategies across backbones on the average Next Period AUROC (%). In the Baseline approach, the model is trained and evaluated on the resized version of images. In Resize, the model is trained on resized images but, at evaluation time, both the resized images and five of their crops are considered (fusing scores with equal weight between the resized image and the multi-crops). The Crop approach follows the same evaluation mechanism, but the model is trained on crops only.
  • Figure 5: Plain multiclass training: comparison of fusion strategies (sum vs. max) against binary baseline on the average Next Period AUROC (%).
  • ...and 4 more figures