Table of Contents
Fetching ...

Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

Maheswar Bora, Tashvik Dhamija, Shukesh Reddy, Baptiste Chopin, Pranav Balaji, Abhijit Das, Antitza Dantcheva

TL;DR

This work tackles the urgent need for robust, generalizable deepfake detection by leveraging pre-trained Visual Speech Recognition (VSR) features. The authors introduce FauxNet, a multitask detector that not only distinguishes real from fake videos but also attributes the generation technique, with strong zero-shot performance. They further contribute Authentica, a large dataset bridging video- and audio-driven deepfake methods to evaluate generalization and attribution. Empirical results on FF++ and Authentica demonstrate that VSR-based features yield superior separability of real/fake and technique clusters, outperforming several baselines in both in-distribution and zero-shot settings. The work highlights practical benefits for real-world deployment and points to future work in scaling the dataset and enabling real-time, adaptive detection through active learning.

Abstract

Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.

Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

TL;DR

This work tackles the urgent need for robust, generalizable deepfake detection by leveraging pre-trained Visual Speech Recognition (VSR) features. The authors introduce FauxNet, a multitask detector that not only distinguishes real from fake videos but also attributes the generation technique, with strong zero-shot performance. They further contribute Authentica, a large dataset bridging video- and audio-driven deepfake methods to evaluate generalization and attribution. Empirical results on FF++ and Authentica demonstrate that VSR-based features yield superior separability of real/fake and technique clusters, outperforming several baselines in both in-distribution and zero-shot settings. The work highlights practical benefits for real-world deployment and points to future work in scaling the dataset and enabling real-time, adaptive detection through active learning.

Abstract

Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.

Paper Structure

This paper contains 25 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The images shown are the 0th, 20th, 40th, 60th, 80th, and 100th frames (from the left) for the same real video ID. Here, Ground Truth refers to the text extracted from whisper radford2023robust and the remaining labels refer to text extracted using VSR. The Word Error Rate (WER) morris2004wer has been calculated against the ground truth text after removing the punctuation marks.
  • Figure 2: Kernel Density Estimation (KDE) plot for the deviation between Ground truth and VSR transcripts for LIA wang2022latent, PiRender-based techniques (PiRenderer conf/iccv/Ren0CL021, StyleTalk StyleTalk2022, DreamTalk dreamtalker2023), SadTalker Chen2023, Wav2Lip prajwal2020lipsync and real videos. Deviation measured using BLEU Papineni02bleu:a, METEOR lavie2007meteor, ROUGE-1, ROUGE-2, ROUGE-L lin-2004-rouge, WER morris2004wer score on proposed Authentica-Vox dataset.
  • Figure 3: T-SNE plots of feature embeddings from the VSR Encoder in different models on the test set of the proposed datasets. (a) corresponds to the proposed Authentica-Vox dataset, while (b) corresponds to the proposed Authentica-HDTF dataset.
  • Figure 4: Deepfake detection and classification of deepfake generation technique. We crop the lip region and provide it to VSR as input. Then, the embeddings generated by the VSR-encoder are averaged pooled along the time dimension to obtain one unified video embedding $Z$. This is passed into the common MLP to obtain $Z_c$, which is then used by the two linear heads for detection (i.e., real/fake) and classification (i.e., which type of manipulation).
  • Figure 5: Kernel Density Estimation (KDE) plot for the deviation between Ground truth and VSR transcripts for LIA wang2022latent, PiRender-based techniques (PiRenderer conf/iccv/Ren0CL021, StyleTalk StyleTalk2022, DreamTalk dreamtalker2023), SadTalker Chen2023, Wav2Lip prajwal2020lipsync and real videos. Deviation measured using BLEU Papineni02bleu:a, METEOR lavie2007meteor, ROUGE-1, ROUGE-2, ROUGE-L lin-2004-rouge, WER morris2004wer score on proposed Authentica-HDTF dataset.
  • ...and 2 more figures