Table of Contents
Fetching ...

Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

Cai Yu, Shan Jia, Xiaomeng Fu, Jin Liu, Jiahe Tian, Jiao Dai, Xi Wang, Siwei Lyu, Jizhong Han

TL;DR

The paper tackles the problem of generalizable cross-modal deepfake detection by moving beyond reliance on audio-visual synchronization cues. It introduces a correlation distillation framework with a dual-branch architecture: a Deepfake Detection Branch for binary prediction and a Correlation Distillation Branch that leverages ASR/VSR teacher models to supervise content-based cross-modal correlation, augmented by a joint-modal contrastive loss. A new Cross-Modal Deepfake Dataset (CMDFD) is proposed to evaluate diverse forgeries, including lip-sync and talking-head generation. Experimental results on CMDFD and FakeAVCeleb demonstrate improved generalization across unseen cross-modal forgery types, and ablations confirm the contribution of each component. The work provides a practical path toward robust multimodal deepfake detection and offers a valuable benchmark for future research.

Abstract

With the rising prevalence of deepfakes, there is a growing interest in developing generalizable detection methods for various types of deepfakes. While effective in their specific modalities, traditional detection methods fall short in addressing the generalizability of detection across diverse cross-modal deepfakes. This paper aims to explicitly learn potential cross-modal correlation to enhance deepfake detection towards various generation scenarios. Our approach introduces a correlation distillation task, which models the inherent cross-modal correlation based on content information. This strategy helps to prevent the model from overfitting merely to audio-visual synchronization. Additionally, we present the Cross-Modal Deepfake Dataset (CMDFD), a comprehensive dataset with four generation methods to evaluate the detection of diverse cross-modal deepfakes. The experimental results on CMDFD and FakeAVCeleb datasets demonstrate the superior generalizability of our method over existing state-of-the-art methods. Our code and data can be found at \url{https://github.com/ljj898/CMDFD-Dataset-and-Deepfake-Detection}.

Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

TL;DR

The paper tackles the problem of generalizable cross-modal deepfake detection by moving beyond reliance on audio-visual synchronization cues. It introduces a correlation distillation framework with a dual-branch architecture: a Deepfake Detection Branch for binary prediction and a Correlation Distillation Branch that leverages ASR/VSR teacher models to supervise content-based cross-modal correlation, augmented by a joint-modal contrastive loss. A new Cross-Modal Deepfake Dataset (CMDFD) is proposed to evaluate diverse forgeries, including lip-sync and talking-head generation. Experimental results on CMDFD and FakeAVCeleb demonstrate improved generalization across unseen cross-modal forgery types, and ablations confirm the contribution of each component. The work provides a practical path toward robust multimodal deepfake detection and offers a valuable benchmark for future research.

Abstract

With the rising prevalence of deepfakes, there is a growing interest in developing generalizable detection methods for various types of deepfakes. While effective in their specific modalities, traditional detection methods fall short in addressing the generalizability of detection across diverse cross-modal deepfakes. This paper aims to explicitly learn potential cross-modal correlation to enhance deepfake detection towards various generation scenarios. Our approach introduces a correlation distillation task, which models the inherent cross-modal correlation based on content information. This strategy helps to prevent the model from overfitting merely to audio-visual synchronization. Additionally, we present the Cross-Modal Deepfake Dataset (CMDFD), a comprehensive dataset with four generation methods to evaluate the detection of diverse cross-modal deepfakes. The experimental results on CMDFD and FakeAVCeleb datasets demonstrate the superior generalizability of our method over existing state-of-the-art methods. Our code and data can be found at \url{https://github.com/ljj898/CMDFD-Dataset-and-Deepfake-Detection}.
Paper Structure (13 sections, 5 equations, 3 figures, 4 tables)

This paper contains 13 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Visualization of audio-visual correlation in various types of cross-modal deepfakes by calculating the cosine similarity between audio and visual embeddings (fake class in red and real in blue). The top row displays the correlation patterns captured by the baseline model tao2021someone, where the correlation exhibits varying intensities across different deepfake generation methods. Deepfakes generated by talking head methods tend to show relatively weaker audio-visual correlations. At the same time, lip-sync deepfakes, which are manipulated primarily in the lip region, demonstrate a higher degree of lip-voice synchrony, resulting in a stronger correlation than natural videos. In contrast, our model achieves a uniform distribution of audio-visual correlation across all four types of deepfake generation techniques. Note that for clarity in comparison, we have unified the scales of the x and y axes across all figures.
  • Figure 2: An overview of our method. The overall framework comprises two branches: a detection branch for deepfake prediction and a distillation branch dedicated to cross-modal correlation learning.
  • Figure 3: Example video frames in CMDFD.