Table of Contents
Fetching ...

Tell me Habibi, is it Real or Fake?

Kartik Kuckreja, Parul Gupta, Injy Hamed, Thamar Solorio, Muhammad Haris Khan, Abhinav Dhall

TL;DR

This work tackles the gap in deepfake detection for multilingual, code-switched speech by introducing ArEnAV, the first large-scale Arabic–English audiovisual deepfake dataset featuring intra-utterance CSW and dialectal variation. The authors present a novel data-generation pipeline combining four TTS systems and two lip-sync models, controlled transcript manipulations via GPT-4.1-mini, and diffusion-based visual synthesis to produce realistic multilingual deepfakes. Benchmarking against monolingual/multilingual datasets and state-of-the-art detectors, along with human evaluation, reveals substantial challenges in detecting and localizing CSW AV deepfakes, highlighting the need for robust multilingual detection methods. ArEnAV thus provides a comprehensive resource for advancing multilingual deepfake localization and detection research, with open-access data and a detailed methodological framework for future improvements, including LLM-based detectors fine-tuned on CSW content.

Abstract

Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce \textbf{ArEnAV}, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It \textbf{contains 387k videos and over 765 hours of real and fake videos}. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research. The dataset can be accessed \href{https://huggingface.co/datasets/kartik060702/ArEnAV-Full}{here}.

Tell me Habibi, is it Real or Fake?

TL;DR

This work tackles the gap in deepfake detection for multilingual, code-switched speech by introducing ArEnAV, the first large-scale Arabic–English audiovisual deepfake dataset featuring intra-utterance CSW and dialectal variation. The authors present a novel data-generation pipeline combining four TTS systems and two lip-sync models, controlled transcript manipulations via GPT-4.1-mini, and diffusion-based visual synthesis to produce realistic multilingual deepfakes. Benchmarking against monolingual/multilingual datasets and state-of-the-art detectors, along with human evaluation, reveals substantial challenges in detecting and localizing CSW AV deepfakes, highlighting the need for robust multilingual detection methods. ArEnAV thus provides a comprehensive resource for advancing multilingual deepfake localization and detection research, with open-access data and a detailed methodological framework for future improvements, including LLM-based detectors fine-tuned on CSW content.

Abstract

Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce \textbf{ArEnAV}, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It \textbf{contains 387k videos and over 765 hours of real and fake videos}. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research. The dataset can be accessed \href{https://huggingface.co/datasets/kartik060702/ArEnAV-Full}{here}.

Paper Structure

This paper contains 16 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: a) We show the data generation pipeline for ArEnAV dataset. In a) input videos are analysed for audio, face, and text extraction. Using few-shot prompts with GPT-4.1-mini, CSW-based spoken text manipulation is performed. This is followed by speech and face enactment generation. b-d) The plots show the data splits and CSW distribution. Here is an example of CSW input and manipulated text with translations in parentheses: <نصنع> hope (“We create hope.”) --> <نصنع> fun (“We create fun.”)
  • Figure 2: Dataset distribution for i) Train, ii) Val and iii) Test split. The outer layer shows the split between various combinations of Text-to-Speech and Lip-Sync models used for audio-visual manipulation. The middle layer shows the distribution of language in the original transcript, which is either Ar (Arabic) or CSW (Code-Switched English-Arabic). The inner layer shows the distribution of different operations applied to the original transcripts, "meaning only", "dialect+meaning", and "meaning + translation" (For fine-grained detail about what they entail, refer to Table \ref{['tab:augmentation_examples']}.)
  • Figure 3: System prompt for text-perturbation bot