Table of Contents
Fetching ...

AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, Kalin Stefanov

TL;DR

AV-Deepfake1M introduces the largest content-driven audio-visual deepfake dataset for temporal localization, aggregating audio, visual, and audio-visual manipulations across 2K+ subjects and 1M videos. It presents a three-stage data-generation pipeline powered by a large language model for transcript manipulation and state-of-the-art audio/video synthesis to enhance realism. Comprehensive statistics, human quality assessments, and broad benchmark results demonstrate significant performance gaps for existing methods, underscoring the need for next-generation localization techniques. The dataset enables realistic evaluation and advancement of robust detection and localization methods with meaningful implications for multimedia authenticity and security.

Abstract

The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code are available at https://github.com/ControlNet/AV-Deepfake1M .

AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

TL;DR

AV-Deepfake1M introduces the largest content-driven audio-visual deepfake dataset for temporal localization, aggregating audio, visual, and audio-visual manipulations across 2K+ subjects and 1M videos. It presents a three-stage data-generation pipeline powered by a large language model for transcript manipulation and state-of-the-art audio/video synthesis to enhance realism. Comprehensive statistics, human quality assessments, and broad benchmark results demonstrate significant performance gaps for existing methods, underscoring the need for next-generation localization techniques. The dataset enables realistic evaluation and advancement of robust detection and localization methods with meaningful implications for multimedia authenticity and security.

Abstract

The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code are available at https://github.com/ControlNet/AV-Deepfake1M .
Paper Structure (27 sections, 9 figures, 11 tables)

This paper contains 27 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Data manipulation and generation pipeline. Overview of the proposed three-stage pipeline. Given a real video, the pre-processing consists of audio extraction via FFmpeg followed by Whisper-based transcript generation. In the first stage, transcript manipulation, the original transcript is modified through word-level insertions, deletions, and replacements. In the second stage, audio generation, based on the relevant transcript manipulation, the audio is generated in both speaker-dependent and independent fashion. In the final stage, video generation, based on the generated audio, the subject-dependant video is generated with smooth transitions in terms of lip-synchronization, pose, and other relevant attributes.
  • Figure 2: Comparison of transcript modifications in AV-Deepfake1M and LAV-DF.
  • Figure 3: Data partitioning in AV-Deepfake1M. (a) The number of subjects in the train, validation, and test sets. (b) The number of videos in the train, validation, and test sets. (c) The number of videos with different audio generation methods in the train set. (d) The number of videos with different audio generation methods in the validation set. (e) The number of videos with different audio generation methods in the test set. F denotes audio generation for the full transcript and cropping of the new_word(s) and W denotes audio generation only for the new_word(s).
  • Figure 4: Comparison of AV-Deepfake1M and LAV-DF. The left three-row three-column histograms illustrate the fake segment absolute lengths (sec), the fake segment lengths proportion in videos (%) and the video lengths (sec) in the train, validation, and test sets. In the middle, the histograms illustrate the overall statistics for fake segment lengths, proportions and video lengths, compared with LAV-DF. For the fake segment lengths and proportions, the X-axis is in log scale and for video lengths, the X-axis is in linear scale. For all histograms, the Y-axis is in linear scale. The vertical dotted lines and numbers in histograms represent the mean value. On the right side, (a) The number of segments with different modifications and (b) The number of videos with different numbers of segments per video.
  • Figure 5: Qualitative comparison of transcript modifications in AV-Deepfake1M and LAV-DF. (a) The old words before the manipulations in AV-Deepfake1M. (b) The new words after the LLM-driven manipulations in AV-Deepfake1M. (c) The old words before manipulations in LAV-DF. (d) The new words after the rule-based manipulations in LAV-DF.
  • ...and 4 more figures