PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset
Yang Hou, Haitao Fu, Chuankai Chen, Zida Li, Haoyu Zhang, Jianjun Zhao
TL;DR
PolyGlotFake addresses the gap in deepfake research by introducing a multilingual, multimodal dataset spanning seven languages and multiple audio-visual manipulation techniques, paired with fine-grained technical labels for traceability. The dataset is built via a Whisper-based transcription and translation pipeline, followed by diverse TTS/voice-cloning methods and lip-sync technologies to produce high-quality fake videos. A rigorous benchmark of 13 state-of-the-art detectors demonstrates that cross-language, cross-technique content significantly challenges existing detectors and that PolyGlotFake offers a more rigorous, global evaluation setting than prior datasets. The work highlights the dataset’s value for advancing multimodal deepfake detection and outlines future directions in linguistic diversification, scaling, and adversarial robustness.
Abstract
With the rapid advancement of generative AI, multimodal deepfakes, which manipulate both audio and visual modalities, have drawn increasing public concern. Currently, deepfake detection has emerged as a crucial strategy in countering these growing threats. However, as a key factor in training and validating deepfake detectors, most existing deepfake datasets primarily focus on the visual modal, and the few that are multimodal employ outdated techniques, and their audio content is limited to a single language, thereby failing to represent the cutting-edge advancements and globalization trends in current deepfake technologies. To address this gap, we propose a novel, multilingual, and multimodal deepfake dataset: PolyGlotFake. It includes content in seven languages, created using a variety of cutting-edge and popular Text-to-Speech, voice cloning, and lip-sync technologies. We conduct comprehensive experiments using state-of-the-art detection methods on PolyGlotFake dataset. These experiments demonstrate the dataset's significant challenges and its practical value in advancing research into multimodal deepfake detection.
