Table of Contents
Fetching ...

Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection

Heqing Zou, Meng Shen, Yuchen Hu, Chen Chen, Eng Siong Chng, Deepu Rajan

TL;DR

This work tackles the robustness of audio-visual deepfake detection when manipulations occur in one or both modalities by introducing cross-modality and within-modality regularization to preserve modality distinctions during multimodal learning. It combines an audio-visual transformer for improved modality correspondence with a representation-regularization framework that includes a cross-modality contrastive loss and modality-specific margins or classification losses. On FakeAVCeleb, the proposed MRDF approach achieves state-of-the-art performance (e.g., accuracy ~94% and AUC ~92%), with ablations confirming the complementary benefits of both regularization components and visualizations illustrating better alignment of unimodal representations. The method offers a practical, annotation-efficient path to more reliable multimodal deepfake detection in real-world scenarios where modalities can be independently altered.

Abstract

Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and inconsistencies in learned representations caused by independent modality manipulations in deepfake videos. To address this, we propose cross-modality and within-modality regularization to preserve modality distinctions during multimodal representation learning. Our approach includes an audio-visual transformer module for modality correspondence and a cross-modality regularization module to align paired audio-visual signals, preserving modality distinctions. Simultaneously, a within-modality regularization module refines unimodal representations with modality-specific targets to retain modal-specific details. Experimental results on the public audio-visual dataset, FakeAVCeleb, demonstrate the effectiveness and competitiveness of our approach.

Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection

TL;DR

This work tackles the robustness of audio-visual deepfake detection when manipulations occur in one or both modalities by introducing cross-modality and within-modality regularization to preserve modality distinctions during multimodal learning. It combines an audio-visual transformer for improved modality correspondence with a representation-regularization framework that includes a cross-modality contrastive loss and modality-specific margins or classification losses. On FakeAVCeleb, the proposed MRDF approach achieves state-of-the-art performance (e.g., accuracy ~94% and AUC ~92%), with ablations confirming the complementary benefits of both regularization components and visualizations illustrating better alignment of unimodal representations. The method offers a practical, annotation-efficient path to more reliable multimodal deepfake detection in real-world scenarios where modalities can be independently altered.

Abstract

Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and inconsistencies in learned representations caused by independent modality manipulations in deepfake videos. To address this, we propose cross-modality and within-modality regularization to preserve modality distinctions during multimodal representation learning. Our approach includes an audio-visual transformer module for modality correspondence and a cross-modality regularization module to align paired audio-visual signals, preserving modality distinctions. Simultaneously, a within-modality regularization module refines unimodal representations with modality-specific targets to retain modal-specific details. Experimental results on the public audio-visual dataset, FakeAVCeleb, demonstrate the effectiveness and competitiveness of our approach.
Paper Structure (17 sections, 6 equations, 4 figures, 4 tables)

This paper contains 17 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Proposed Modality-Regularization-based DeepFake (MRDF) detection on RealAudio-FakeVideo (RAFV) and RealAudio-RealVideo (RARV) categories. (AVDF: Baseline Audio-Visual DeepFake detection, GT: Ground-Truth)
  • Figure 2: Our proposed approach consists of A$\_$E and V$\_$E, representing the audio and video frame encoders. A$\_$P, V$\_$P, and AV$\_$P are the audio feature projector, video feature projector, and audio-visual feature projector, respectively. The symbol $\oplus$ denotes feature concatenation.
  • Figure 3: T-SNE visualization of the audio and visual representations before fusion of the ablation study methods.
  • Figure 4: T-SNE visualization of the deepfake prediction of (a) Multimodal AVDF and our proposed (b) MRDF-Margin (c) MRDF-CE.