Table of Contents
Fetching ...

FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

Hasam Khalid, Shahroz Tariq, Minha Kim, Simon S. Woo

TL;DR

The paper introduces FakeAVCeleb, a novel multimodal deepfake dataset containing lip-synced fake audios paired with videos derived from VoxCeleb2, covering diverse ethnicities to mitigate bias. It details a generation pipeline using Faceswap/FSGAN/Wav2Lip for video, SV2TTS for voice cloning, and similarity-based target selection to produce over 20k samples across four audio-video combinations. Comprehensive benchmarking with unimodal, ensemble, and multimodal detectors demonstrates FakeAVCeleb’s higher detection difficulty compared with existing datasets, highlighting the need for advanced multimodal detectors. The work emphasizes data quality, controlled access to mitigate misuse, and plans for future updates and scaling to support robust, real-world deepfake detection research.

Abstract

While the significant advancements have made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is emerging, where AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake audios and videos, a new generation of deepfake detectors is needed to focus on both video and audio collectively. To develop a competent deepfake detector, a large amount of high-quality data is typically required to capture real-world (or practical) scenarios. Existing deepfake datasets either contain deepfake videos or audios, which are racially biased as well. As a result, it is critical to develop a high-quality video and audio deepfake dataset that can be used to detect both audio and video deepfakes simultaneously. To fill this gap, we propose a novel Audio-Video Deepfake dataset, FakeAVCeleb, which contains not only deepfake videos but also respective synthesized lip-synced fake audios. We generate this dataset using the most popular deepfake generation methods. We selected real YouTube videos of celebrities with four ethnic backgrounds to develop a more realistic multimodal dataset that addresses racial bias, and further help develop multimodal deepfake detectors. We performed several experiments using state-of-the-art detection methods to evaluate our deepfake dataset and demonstrate the challenges and usefulness of our multimodal Audio-Video deepfake dataset.

FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

TL;DR

The paper introduces FakeAVCeleb, a novel multimodal deepfake dataset containing lip-synced fake audios paired with videos derived from VoxCeleb2, covering diverse ethnicities to mitigate bias. It details a generation pipeline using Faceswap/FSGAN/Wav2Lip for video, SV2TTS for voice cloning, and similarity-based target selection to produce over 20k samples across four audio-video combinations. Comprehensive benchmarking with unimodal, ensemble, and multimodal detectors demonstrates FakeAVCeleb’s higher detection difficulty compared with existing datasets, highlighting the need for advanced multimodal detectors. The work emphasizes data quality, controlled access to mitigate misuse, and plans for future updates and scaling to support robust, real-world deepfake detection research.

Abstract

While the significant advancements have made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is emerging, where AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake audios and videos, a new generation of deepfake detectors is needed to focus on both video and audio collectively. To develop a competent deepfake detector, a large amount of high-quality data is typically required to capture real-world (or practical) scenarios. Existing deepfake datasets either contain deepfake videos or audios, which are racially biased as well. As a result, it is critical to develop a high-quality video and audio deepfake dataset that can be used to detect both audio and video deepfakes simultaneously. To fill this gap, we propose a novel Audio-Video Deepfake dataset, FakeAVCeleb, which contains not only deepfake videos but also respective synthesized lip-synced fake audios. We generate this dataset using the most popular deepfake generation methods. We selected real YouTube videos of celebrities with four ethnic backgrounds to develop a more realistic multimodal dataset that addresses racial bias, and further help develop multimodal deepfake detectors. We performed several experiments using state-of-the-art detection methods to evaluate our deepfake dataset and demonstrate the challenges and usefulness of our multimodal Audio-Video deepfake dataset.

Paper Structure

This paper contains 31 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Samples from the Dataset. We divide the dataset into 5 ethnic groups Black, South Asian, East Asian, Caucasian (American) and Caucasian (European). There are total 4 combinations of our dataset: $\mathbfcal{A_R V_R}$ (500), $\mathbfcal{A_F V_R}$ (500), $\mathbfcal{A_R V_F}$ (9,000), and $\mathbfcal{A_F V_F}$ (10,000).
  • Figure 2: Sample spectrogram of real audio $\mathbfcal{A_R}$ (left) and fake audio $\mathbfcal{A_F}$ (right).
  • Figure 3: A step-by-step description of our FakeAVCeleb generation pipeline. The first, second, and the third method represents $\mathbfcal{A_F V_R}$, $\mathbfcal{A_R V_F}$, and $\mathbfcal{A_F V_F}$ generation methods, respectively, where the second ($\mathbfcal{A_R V_F}$) and the third method ($\mathbfcal{A_F V_F}$) contains lip-synched, and the first method ($\mathbfcal{A_F V_R}$) contains lip-unsynced deepfake videos.
  • Figure 4: ROC curves of three state-of-the-art detection methods on five deepfake datasets. Only video frames are used from all datasets except FakeAVCeleb, where we used both audio (MFCCs) and video (frames) to form an ensemble model (see Appendix for more results). The AUC scores of these SOTA models on our FakeAVCeleb are $72.5\%$, $61.7\%$, and $60.9\%$.
  • Figure 5: Average AUC score of deepfake detectors over all datasets.
  • ...and 5 more figures