Table of Contents
Fetching ...

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Nuria Alina Chandra, Ryan Murtfeldt, Lin Qiu, Arnab Karmakar, Hannah Lee, Emmanuel Tanumihardja, Kevin Farhat, Ben Caffee, Sejin Paik, Changyeon Lee, Jongwook Choi, Aerin Kim, Oren Etzioni

TL;DR

Deepfake-Eval-2024 addresses the misalignment between academic deepfake benchmarks and real-world threats by compiling a large, multimodal in-the-wild dataset (45 h video, 56.5 h audio, 1,975 images) from 88 domains and 52 languages collected in 2024. The study systematically evaluates open-source, finetuned, and commercial detectors, revealing substantial performance gaps on real-world data and demonstrating that finetuning helps but does not reach human forensic capabilities. Key findings include a 50% average drop for open-source video detectors, large gains from finetuning (especially in audio), and top commercial models still falling short of human analysts, with diffusion-generated content and non-facial manipulations posing particular challenges. The dataset aims to drive development of robust detectors for evolving deepfake threats, while acknowledging curation costs, labeling challenges, and the need for ongoing data updates and ethical access controls.

Abstract

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

TL;DR

Deepfake-Eval-2024 addresses the misalignment between academic deepfake benchmarks and real-world threats by compiling a large, multimodal in-the-wild dataset (45 h video, 56.5 h audio, 1,975 images) from 88 domains and 52 languages collected in 2024. The study systematically evaluates open-source, finetuned, and commercial detectors, revealing substantial performance gaps on real-world data and demonstrating that finetuning helps but does not reach human forensic capabilities. Key findings include a 50% average drop for open-source video detectors, large gains from finetuning (especially in audio), and top commercial models still falling short of human analysts, with diffusion-generated content and non-facial manipulations posing particular challenges. The dataset aims to drive development of robust detectors for evolving deepfake threats, while acknowledging curation costs, labeling challenges, and the need for ongoing data updates and ethical access controls.

Abstract

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

Paper Structure

This paper contains 25 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Examples of Deepfake-Eval-2024 video and audio (rows 1–2), and images (rows 3–4), demonstrating a diversity of content styles and generation techniques, including lipsync, faceswap, and diffusion. Images have been resized for presentation.
  • Figure 2: Distribution of data origins and languages in Deepfake-Eval-2024. In total, media was shared from 88 different web-domain names. The dataset also contains a total of 52 different languages (42 languages in Deepfake-Eval-2024-audio and 49 languages in Deepfake-Eval-2024-video). Languages were identified using speech recognition model Whisper radford2022robustspeechrecognitionlargescale.
  • Figure S1: Origins of data in Deepfake-Eval-2024 separated by modality. In total, media was shared from 88 different web-domain names. Direct upload indicates that the media was uploaded directly to TrueMedia.org by a user, instead of the user providing a link to a social media website.
  • Figure S2: Language distributions for audio and video content.