Table of Contents
Fetching ...

Deepfake Detection that Generalizes Across Benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, Mario Fritz

TL;DR

The paper tackles the persistent generalization gap in deepfake detection, showing that unseen manipulation techniques can be effectively detected by minimally adapting a pre-trained vision encoder. The GenD approach tunes only Layer Normalization parameters, applies L2 normalization to the classification token, and uses alignment and uniformity losses to sculpt a hyperspherical feature space, trained with paired real-fake data for robust generalization. Across 14 benchmarks from 2019 to 2025, GenD achieves state-of-the-art average cross-dataset AUROC, often exceeding more complex methods while using only about 0.03% of the model parameters. The work also reveals the importance of diverse, paired data and the advantage of including older, diverse forgery techniques in training, offering practical, scalable guidance for deploying robust deepfake detectors.

Abstract

The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders. The proposed method, GenD, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and metric learning on it. We conducted an extensive evaluation on 14 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained foundational image encoder model. The code is at: https://github.com/yermandy/GenD

Deepfake Detection that Generalizes Across Benchmarks

TL;DR

The paper tackles the persistent generalization gap in deepfake detection, showing that unseen manipulation techniques can be effectively detected by minimally adapting a pre-trained vision encoder. The GenD approach tunes only Layer Normalization parameters, applies L2 normalization to the classification token, and uses alignment and uniformity losses to sculpt a hyperspherical feature space, trained with paired real-fake data for robust generalization. Across 14 benchmarks from 2019 to 2025, GenD achieves state-of-the-art average cross-dataset AUROC, often exceeding more complex methods while using only about 0.03% of the model parameters. The work also reveals the importance of diverse, paired data and the advantage of including older, diverse forgery techniques in training, offering practical, scalable guidance for deploying robust deepfake detectors.

Abstract

The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders. The proposed method, GenD, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and metric learning on it. We conducted an extensive evaluation on 14 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained foundational image encoder model. The code is at: https://github.com/yermandy/GenD

Paper Structure

This paper contains 21 sections, 3 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: The architecture of GenD (CLIP). The gray rectangle represents the original CLIP ViT image encoder, and green represents the added components. Only rectangles with a fire icon represent layers with trainable parameters.
  • Figure 2: Samples from (a) Paired and (b) Unpaired datasets.
  • Figure 3: Video-level AUROC for (a) Training and (b) Validation, averaged over 20 randomly sampled paired and 20 unpaired datasets from FF++. The image encoder is $\text{PE}_\text{core}\text{L}$.
  • Figure 4: Evolution of detection difficulty over time. Each video-level AUROC is computed on the test set of the corresponding benchmark. Each curve represents a model trained on a single dataset. Highlighted circles indicate the model’s in-dataset performance.
  • Figure 5: Robustness to image degradations for GenD ($\text{PE}_\text{core}\text{L}$), ForAda ForensicsAdapter, and Effort Effort. Video-level AUROC (%) is calculated across all 14 test datasets. Error bars for GenD are computed from models trained with 5 different training seeds. In the resize, we also average every method across 6 interpolation strategies: nearest, lanczos, bilinear, bicubic, box, hamming.
  • ...and 7 more figures