Table of Contents
Fetching ...

DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization

Christos Koutlis, Symeon Papadopoulos

TL;DR

DiMoDif addresses the challenge of detecting and localizing audio-visual deepfakes by exploiting cross-modal speech differences between VSR and ASR. It introduces Discourse-related Feature Extraction (DiFE) and Modality-information Differentiation (MiD) built on a Transformer with local cross-modal attention and feature pyramids, plus a composite loss for frame-level detections and fake intervals. The method achieves state-of-the-art performance on AV-Deepfake1M in both Deepfake Detection (DFD) and Temporal Forgery Localization (TFL), with margins of $+30.5$ AUC and $+47.88$ AP@0.75, respectively, and shows strong generalization and robustness on FakeAVCeleb, LAV-DF, and real-world data. This work advances practical multimedia authentication by enabling precise localization of partial manipulations and providing interpretable cross-modal cues that remain reliable under perturbations and cross-language settings.

Abstract

Deepfake technology has rapidly advanced and poses significant threats to information integrity and trust in online multimedia. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts or in subtle ways, presents highly challenging detection scenarios. To address these challenges, we present DiMoDif, an audio-visual deepfake detection framework that leverages the inter-modality differences in machine perception of speech, based on the assumption that in real samples -- in contrast to deepfakes -- visual and audio signals coincide in terms of information. DiMoDif leverages features from deep networks that specialize in visual and audio speech recognition to spot frame-level cross-modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, we devise a hierarchical cross-modal fusion network, integrating adaptive temporal alignment modules and a learned discrepancy mapping layer to explicitly model the subtle differences between visual and audio representations. Then, the detection model is optimized through a composite loss function accounting for frame-level detections and fake intervals localization. DiMoDif outperforms the state-of-the-art on the Deepfake Detection task by 30.5 AUC on the highly challenging AV-Deepfake1M, while it performs exceptionally on FakeAVCeleb and LAV-DF. On the Temporal Forgery Localization task, it outperforms the state-of-the-art by 47.88 AP@0.75 on AV-Deepfake1M, and performs on-par on LAV-DF. Code available at https://github.com/mever-team/dimodif.

DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization

TL;DR

DiMoDif addresses the challenge of detecting and localizing audio-visual deepfakes by exploiting cross-modal speech differences between VSR and ASR. It introduces Discourse-related Feature Extraction (DiFE) and Modality-information Differentiation (MiD) built on a Transformer with local cross-modal attention and feature pyramids, plus a composite loss for frame-level detections and fake intervals. The method achieves state-of-the-art performance on AV-Deepfake1M in both Deepfake Detection (DFD) and Temporal Forgery Localization (TFL), with margins of AUC and AP@0.75, respectively, and shows strong generalization and robustness on FakeAVCeleb, LAV-DF, and real-world data. This work advances practical multimedia authentication by enabling precise localization of partial manipulations and providing interpretable cross-modal cues that remain reliable under perturbations and cross-language settings.

Abstract

Deepfake technology has rapidly advanced and poses significant threats to information integrity and trust in online multimedia. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts or in subtle ways, presents highly challenging detection scenarios. To address these challenges, we present DiMoDif, an audio-visual deepfake detection framework that leverages the inter-modality differences in machine perception of speech, based on the assumption that in real samples -- in contrast to deepfakes -- visual and audio signals coincide in terms of information. DiMoDif leverages features from deep networks that specialize in visual and audio speech recognition to spot frame-level cross-modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, we devise a hierarchical cross-modal fusion network, integrating adaptive temporal alignment modules and a learned discrepancy mapping layer to explicitly model the subtle differences between visual and audio representations. Then, the detection model is optimized through a composite loss function accounting for frame-level detections and fake intervals localization. DiMoDif outperforms the state-of-the-art on the Deepfake Detection task by 30.5 AUC on the highly challenging AV-Deepfake1M, while it performs exceptionally on FakeAVCeleb and LAV-DF. On the Temporal Forgery Localization task, it outperforms the state-of-the-art by 47.88 AP@0.75 on AV-Deepfake1M, and performs on-par on LAV-DF. Code available at https://github.com/mever-team/dimodif.

Paper Structure

This paper contains 32 sections, 5 equations, 16 figures, 13 tables.

Figures (16)

  • Figure 1: Partial audio-visual manipulation leads to different visual and audio speech predictions. DiMoDif detects and localizes the fake part based on feature space incongruity.
  • Figure 2: Identifying machine perception discrepancies between visual and audio speech for deepfake detection. In (a), a video's visual and audio streams are separately processed by VSR and ASR models, then the outputs' normalized Levenshtein distance $\mathtt{d_L}$ is calculated. In (b,c,d) the $\mathtt{d_L}$ distributions are illustrated for FakeAVCeleb khalid2021fakeavceleb, LAV-DF cai2022you, and AV-Deepfake1M cai2024av.
  • Figure 3: The DiMoDif architecture.
  • Figure 4: Ablation and hyperparameter tuning analysis.
  • Figure 5: Fake video sample (manipulated in red). Avg. and std. (across layers $\lambda$) of cross-modal similarity and frame-level fake probability shown in blue and purple, respectively.
  • ...and 11 more figures