Table of Contents
Fetching ...

ExDDV: A New Dataset for Explainable Deepfake Detection in Video

Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu

TL;DR

ExDDV introduces the first dataset and benchmark for explainable deepfake detection in video, comprising about 5.4K videos annotated with textual artifact explanations and click-based localizations. The study benchmarked three vision-language model families (BLIP-2, Phi-3-Vision, LLaVA-1.5) across pre-trained, in-context, and fine-tuned regimes, incorporating text only and text plus click supervision. Results indicate that fine-tuning yields the strongest explanations and that both text and click supervision are important for accurate artifact localization and description, with a plateau in performance around 2,000 training samples. The dataset and accompanying code provide a foundation for developing robust, trustworthy explainable deepfake detectors and support future research into curriculum-based training and improved alignment with human annotations.

Abstract

The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.

ExDDV: A New Dataset for Explainable Deepfake Detection in Video

TL;DR

ExDDV introduces the first dataset and benchmark for explainable deepfake detection in video, comprising about 5.4K videos annotated with textual artifact explanations and click-based localizations. The study benchmarked three vision-language model families (BLIP-2, Phi-3-Vision, LLaVA-1.5) across pre-trained, in-context, and fine-tuned regimes, incorporating text only and text plus click supervision. Results indicate that fine-tuning yields the strongest explanations and that both text and click supervision are important for accurate artifact localization and description, with a plateau in performance around 2,000 training samples. The dataset and accompanying code provide a foundation for developing robust, trustworthy explainable deepfake detectors and support future research into curriculum-based training and improved alignment with human annotations.

Abstract

The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.

Paper Structure

This paper contains 13 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Examples of video frames from ExDDV with text and click annotations. Clicks are represented as large green dots. Real videos are not annotated with clicks or difficulty levels. The border color indicates the difficulty level: green=easy, orange=medium, red=hard, black=real. Best viewed in color.
  • Figure 2: A screenshot of the application used to annotate ExDDV.
  • Figure 3: Overview of the in-context learning pipeline, which retrieves deepfake annotations from visually similar training frames using a k-NN based on a ResNet backbone. Best viewed in color.
  • Figure 4: Our click supervision pipeline at inference time. A ViT-based click predictor estimates click coordinates for the input frames. A hard or soft masking is applied to mask the area outside the region of interest. The masked frames are given as input to a fine-tuned VLM. Best viewed in color.
  • Figure 5: The performance of fine-tuned LLaVA (vertical axis) against the number of samples used for training (horizontal axis).
  • ...and 6 more figures