Table of Contents
Fetching ...

Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization

Vinaya Sree Katamneni, Ajita Rattani

TL;DR

This work tackles multi-modal deepfake detection and localization by introducing MMMS-BA, a contextual cross-modal attention framework over audio, lip-region, and full-face sequences. By integrating bi-modal and multi-sequence attentions with multiplicative gating, the model captures intra- and inter-modal interactions across time, yielding robust detection and precise localization. Empirical results on AV-Deepfake1M, FakeAVCeleb, LAV-DF, and TVIL show state-of-the-art performance in both detection accuracy and localization precision, with strong cross-dataset generalization. The authors release code and demonstrate the potential of extending the approach to include text and handle missing modalities in future work.

Abstract

In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity. Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat. Current multi-modal deepfake detectors are often based on the attention-based fusion of heterogeneous data streams from multiple modalities. However, the heterogeneous nature of the data (such as audio and visual signals) creates a distributional modality gap and poses a significant challenge in effective fusion and hence multi-modal deepfake detection. In this paper, we propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection. The proposed approach applies attention to multi-modal multi-sequence representations and learns the contributing features among them for deepfake detection and localization. Thorough experimental validations on audio-visual deepfake datasets, namely FakeAVCeleb, AV-Deepfake1M, TVIL, and LAV-DF datasets, demonstrate the efficacy of our approach. Cross-comparison with the published studies demonstrates superior performance of our approach with an improved accuracy and precision by 3.47% and 2.05% in deepfake detection and localization, respectively. Thus, obtaining state-of-the-art performance. To facilitate reproducibility, the code and the datasets information is available at https://github.com/vcbsl/audiovisual-deepfake/.

Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization

TL;DR

This work tackles multi-modal deepfake detection and localization by introducing MMMS-BA, a contextual cross-modal attention framework over audio, lip-region, and full-face sequences. By integrating bi-modal and multi-sequence attentions with multiplicative gating, the model captures intra- and inter-modal interactions across time, yielding robust detection and precise localization. Empirical results on AV-Deepfake1M, FakeAVCeleb, LAV-DF, and TVIL show state-of-the-art performance in both detection accuracy and localization precision, with strong cross-dataset generalization. The authors release code and demonstrate the potential of extending the approach to include text and handle missing modalities in future work.

Abstract

In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity. Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat. Current multi-modal deepfake detectors are often based on the attention-based fusion of heterogeneous data streams from multiple modalities. However, the heterogeneous nature of the data (such as audio and visual signals) creates a distributional modality gap and poses a significant challenge in effective fusion and hence multi-modal deepfake detection. In this paper, we propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection. The proposed approach applies attention to multi-modal multi-sequence representations and learns the contributing features among them for deepfake detection and localization. Thorough experimental validations on audio-visual deepfake datasets, namely FakeAVCeleb, AV-Deepfake1M, TVIL, and LAV-DF datasets, demonstrate the efficacy of our approach. Cross-comparison with the published studies demonstrates superior performance of our approach with an improved accuracy and precision by 3.47% and 2.05% in deepfake detection and localization, respectively. Thus, obtaining state-of-the-art performance. To facilitate reproducibility, the code and the datasets information is available at https://github.com/vcbsl/audiovisual-deepfake/.
Paper Structure (25 sections, 8 equations, 3 figures, 7 tables)

This paper contains 25 sections, 8 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of our proposed audio-visual deepfake detection and localization framework. The audio-visual sequences extracted from the input video are processed using our proposed MMMS-BA approach for deepfake detection and localization.
  • Figure 2: Illustration of the proposed Multi-Modal Multi-Sequence Bi-modal Attention (MMMS-BA) model for audio-visual deepfake detection and localization.
  • Figure 3: Multi-Modal Multi-Sequence Attention computation of Audio and Full Visual Face Modalities ($MMMS\text{-}BA_{AV}$)