Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection
Heqing Zou, Meng Shen, Yuchen Hu, Chen Chen, Eng Siong Chng, Deepu Rajan
TL;DR
This work tackles the robustness of audio-visual deepfake detection when manipulations occur in one or both modalities by introducing cross-modality and within-modality regularization to preserve modality distinctions during multimodal learning. It combines an audio-visual transformer for improved modality correspondence with a representation-regularization framework that includes a cross-modality contrastive loss and modality-specific margins or classification losses. On FakeAVCeleb, the proposed MRDF approach achieves state-of-the-art performance (e.g., accuracy ~94% and AUC ~92%), with ablations confirming the complementary benefits of both regularization components and visualizations illustrating better alignment of unimodal representations. The method offers a practical, annotation-efficient path to more reliable multimodal deepfake detection in real-world scenarios where modalities can be independently altered.
Abstract
Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and inconsistencies in learned representations caused by independent modality manipulations in deepfake videos. To address this, we propose cross-modality and within-modality regularization to preserve modality distinctions during multimodal representation learning. Our approach includes an audio-visual transformer module for modality correspondence and a cross-modality regularization module to align paired audio-visual signals, preserving modality distinctions. Simultaneously, a within-modality regularization module refines unimodal representations with modality-specific targets to retain modal-specific details. Experimental results on the public audio-visual dataset, FakeAVCeleb, demonstrate the effectiveness and competitiveness of our approach.
