Table of Contents
Fetching ...

Performance Decay in Deepfake Detection: The Limitations of Training on Outdated Data

Jack Richings, Margaux Leblanc, Ian Groves, Victoria Nockles

TL;DR

This work addresses the rapid obsolescence of deepfake detectors as generative techniques improve. It presents a two-stage CNN+RNN detector trained on the DeepSpeak dataset and evaluates cross-version generalization, including fine-tuning with limited new data. The findings show AUROC remains high on current data but drops significantly when faced with six months newer deepfakes, with recall for deepfakes decreasing by over 30%, indicating strong concept drift. The study highlights that robust detection relies primarily on frame-level features, not temporal cues, and underscores the importance of rapid, diverse data collection and evaluation to sustain detector effectiveness in practice.

Abstract

The continually advancing quality of deepfake technology exacerbates the threats of disinformation, fraud, and harassment by making maliciously-generated synthetic content increasingly difficult to distinguish from reality. We introduce a simple yet effective two-stage detection method that achieves an AUROC of over 99.8% on contemporary deepfakes. However, this high performance is short-lived. We show that models trained on this data suffer a recall drop of over 30% when evaluated on deepfakes created with generation techniques from just six months later, demonstrating significant decay as threats evolve. Our analysis reveals two key insights for robust detection. Firstly, continued performance requires the ongoing curation of large, diverse datasets. Second, predictive power comes primarily from static, frame-level artifacts, not temporal inconsistencies. The future of effective deepfake detection therefore depends on rapid data collection and the development of advanced frame-level feature detectors.

Performance Decay in Deepfake Detection: The Limitations of Training on Outdated Data

TL;DR

This work addresses the rapid obsolescence of deepfake detectors as generative techniques improve. It presents a two-stage CNN+RNN detector trained on the DeepSpeak dataset and evaluates cross-version generalization, including fine-tuning with limited new data. The findings show AUROC remains high on current data but drops significantly when faced with six months newer deepfakes, with recall for deepfakes decreasing by over 30%, indicating strong concept drift. The study highlights that robust detection relies primarily on frame-level features, not temporal cues, and underscores the importance of rapid, diverse data collection and evaluation to sustain detector effectiveness in practice.

Abstract

The continually advancing quality of deepfake technology exacerbates the threats of disinformation, fraud, and harassment by making maliciously-generated synthetic content increasingly difficult to distinguish from reality. We introduce a simple yet effective two-stage detection method that achieves an AUROC of over 99.8% on contemporary deepfakes. However, this high performance is short-lived. We show that models trained on this data suffer a recall drop of over 30% when evaluated on deepfakes created with generation techniques from just six months later, demonstrating significant decay as threats evolve. Our analysis reveals two key insights for robust detection. Firstly, continued performance requires the ongoing curation of large, diverse datasets. Second, predictive power comes primarily from static, frame-level artifacts, not temporal inconsistencies. The future of effective deepfake detection therefore depends on rapid data collection and the development of advanced frame-level feature detectors.

Paper Structure

This paper contains 19 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the two-stage ResNet-RNN model architecture and training process. Individual video frames are first processed through an augmentation pipeline. A pre-trained ResNet-50, with frozen early layers, learns frame-level embeddings before predicting frame-level labels via the fully-connected (FC) layer. Learned embeddings are then collated and time-ordered for each video and then fed into a GRU-based RNN model, which learns temporal relationships and outputs video-level predictions.
  • Figure 2: Precision-Recall curves for models fine-tuned using subsets of the DeepSpeak version 2.0 dataset. Solid lines show model performance when trained from an ImageNet initialisation. Dashed lines show performance of models initialised using weights learned from the DeepSpeak version 1.1 dataset. Line colours indicate the number of unique individuals used in each training run. The dashed black line shows the performance of a model trained on DeepSpeak version v1.1 without any fine-tuning.
  • Figure 3: Principal Component Analysis (PCA) plots showing the feature-space representation of the DeepSpeak version 2.0 test set computed using (i) a ResNet model trained on the DeepSpeak version 2.0 train set (top panel) and (ii) a model trained on the DeepSpeak version 1.1 dataset (bottom panel). Both models were trained using binary real/fake labels, and exhibit varying abilities to separate different kinds of deepfake, represented by the different colours of points.