DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization
Christos Koutlis, Symeon Papadopoulos
TL;DR
DiMoDif addresses the challenge of detecting and localizing audio-visual deepfakes by exploiting cross-modal speech differences between VSR and ASR. It introduces Discourse-related Feature Extraction (DiFE) and Modality-information Differentiation (MiD) built on a Transformer with local cross-modal attention and feature pyramids, plus a composite loss for frame-level detections and fake intervals. The method achieves state-of-the-art performance on AV-Deepfake1M in both Deepfake Detection (DFD) and Temporal Forgery Localization (TFL), with margins of $+30.5$ AUC and $+47.88$ AP@0.75, respectively, and shows strong generalization and robustness on FakeAVCeleb, LAV-DF, and real-world data. This work advances practical multimedia authentication by enabling precise localization of partial manipulations and providing interpretable cross-modal cues that remain reliable under perturbations and cross-language settings.
Abstract
Deepfake technology has rapidly advanced and poses significant threats to information integrity and trust in online multimedia. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts or in subtle ways, presents highly challenging detection scenarios. To address these challenges, we present DiMoDif, an audio-visual deepfake detection framework that leverages the inter-modality differences in machine perception of speech, based on the assumption that in real samples -- in contrast to deepfakes -- visual and audio signals coincide in terms of information. DiMoDif leverages features from deep networks that specialize in visual and audio speech recognition to spot frame-level cross-modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, we devise a hierarchical cross-modal fusion network, integrating adaptive temporal alignment modules and a learned discrepancy mapping layer to explicitly model the subtle differences between visual and audio representations. Then, the detection model is optimized through a composite loss function accounting for frame-level detections and fake intervals localization. DiMoDif outperforms the state-of-the-art on the Deepfake Detection task by 30.5 AUC on the highly challenging AV-Deepfake1M, while it performs exceptionally on FakeAVCeleb and LAV-DF. On the Temporal Forgery Localization task, it outperforms the state-of-the-art by 47.88 AP@0.75 on AV-Deepfake1M, and performs on-par on LAV-DF. Code available at https://github.com/mever-team/dimodif.
