Table of Contents
Fetching ...

PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset

Yang Hou, Haitao Fu, Chuankai Chen, Zida Li, Haoyu Zhang, Jianjun Zhao

TL;DR

PolyGlotFake addresses the gap in deepfake research by introducing a multilingual, multimodal dataset spanning seven languages and multiple audio-visual manipulation techniques, paired with fine-grained technical labels for traceability. The dataset is built via a Whisper-based transcription and translation pipeline, followed by diverse TTS/voice-cloning methods and lip-sync technologies to produce high-quality fake videos. A rigorous benchmark of 13 state-of-the-art detectors demonstrates that cross-language, cross-technique content significantly challenges existing detectors and that PolyGlotFake offers a more rigorous, global evaluation setting than prior datasets. The work highlights the dataset’s value for advancing multimodal deepfake detection and outlines future directions in linguistic diversification, scaling, and adversarial robustness.

Abstract

With the rapid advancement of generative AI, multimodal deepfakes, which manipulate both audio and visual modalities, have drawn increasing public concern. Currently, deepfake detection has emerged as a crucial strategy in countering these growing threats. However, as a key factor in training and validating deepfake detectors, most existing deepfake datasets primarily focus on the visual modal, and the few that are multimodal employ outdated techniques, and their audio content is limited to a single language, thereby failing to represent the cutting-edge advancements and globalization trends in current deepfake technologies. To address this gap, we propose a novel, multilingual, and multimodal deepfake dataset: PolyGlotFake. It includes content in seven languages, created using a variety of cutting-edge and popular Text-to-Speech, voice cloning, and lip-sync technologies. We conduct comprehensive experiments using state-of-the-art detection methods on PolyGlotFake dataset. These experiments demonstrate the dataset's significant challenges and its practical value in advancing research into multimodal deepfake detection.

PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset

TL;DR

PolyGlotFake addresses the gap in deepfake research by introducing a multilingual, multimodal dataset spanning seven languages and multiple audio-visual manipulation techniques, paired with fine-grained technical labels for traceability. The dataset is built via a Whisper-based transcription and translation pipeline, followed by diverse TTS/voice-cloning methods and lip-sync technologies to produce high-quality fake videos. A rigorous benchmark of 13 state-of-the-art detectors demonstrates that cross-language, cross-technique content significantly challenges existing detectors and that PolyGlotFake offers a more rigorous, global evaluation setting than prior datasets. The work highlights the dataset’s value for advancing multimodal deepfake detection and outlines future directions in linguistic diversification, scaling, and adversarial robustness.

Abstract

With the rapid advancement of generative AI, multimodal deepfakes, which manipulate both audio and visual modalities, have drawn increasing public concern. Currently, deepfake detection has emerged as a crucial strategy in countering these growing threats. However, as a key factor in training and validating deepfake detectors, most existing deepfake datasets primarily focus on the visual modal, and the few that are multimodal employ outdated techniques, and their audio content is limited to a single language, thereby failing to represent the cutting-edge advancements and globalization trends in current deepfake technologies. To address this gap, we propose a novel, multilingual, and multimodal deepfake dataset: PolyGlotFake. It includes content in seven languages, created using a variety of cutting-edge and popular Text-to-Speech, voice cloning, and lip-sync technologies. We conduct comprehensive experiments using state-of-the-art detection methods on PolyGlotFake dataset. These experiments demonstrate the dataset's significant challenges and its practical value in advancing research into multimodal deepfake detection.
Paper Structure (13 sections, 4 figures, 5 tables)

This paper contains 13 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Language distribution in real and fake videos.
  • Figure 2: Synthesis methods distribution in the PolyGlotFake dataset.
  • Figure 3: Generation Pipeline of PolyGlotFake Dataset. Original videos are separated into video and audio. The audio is transcribed into text using Whisper openai_whisper and subsequently translated into multiple languages using a translator. These translated texts are then converted into audio through Text-to-Speech and voice cloning models. Finally, the original video clips are synchronized with the generated audio using a lip-sync model.
  • Figure 4: Visualization of some video frame samples and Mel spectrograms of audio sample clips in the PolyGlotFake dataset.