Table of Contents
Fetching ...

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Youngseo Kim, Kwan Yun, Seokhyeon Hong, Sihun Cha, Colette Suhjung Koo, Junyong Noh

TL;DR

X-AVDT is proposed, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues and highlights the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.

Abstract

The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce MMDF, a new multimodal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by 13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

TL;DR

X-AVDT is proposed, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues and highlights the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.

Abstract

The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce MMDF, a new multimodal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by 13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.
Paper Structure (55 sections, 11 equations, 22 figures, 13 tables, 1 algorithm)

This paper contains 55 sections, 11 equations, 22 figures, 13 tables, 1 algorithm.

Figures (22)

  • Figure 1: Temporally averaged cross-attention maps. For each video, we extract audio–visual cross-attention during DDIM inversion and average the maps over all frames to obtain a single heatmap. Real vs. fake samples exhibit consistent disparities.
  • Figure 2: Input representations with complementary features. (a) video composite $\boldsymbol{\phi}$ is obtained from video $x$ and audio $c$ by running DDIM inversion and reconstruction, decoding both the noisy and clean latents, and computing the residual. We then concatenate four components channel-wise: $x$, $D(\hat{z}_T)$, $D(\hat{z}_0)$, and $\lvert x - D(\hat{z}_0)\rvert$. (b) AV cross-attention feature $\boldsymbol{\psi}$ is extracted during DDIM inversion from the diffusion U-Net and summarized as a frame-aligned tensor. These complementary cues (a) and (b) capture appearance information and modality alignment, respectively. For clarity, all visual elements shown ($D(\hat{z}_T)$, $D(\hat{z}_0)$, and $\lvert x - D(\hat{z}_0)\rvert$) are decoded images.
  • Figure 3: The overall framework of X-AVDT. From each audio-visual pair, we form two inputs $\boldsymbol{\phi}$ and $\boldsymbol{\psi}$. Two 3D encoders map them to features that are concatenated and passed through the Feature Fusion Decoder to produce a fused feature. A classification head outputs the real/fake score, while an embedding head is trained with a triplet objective to improve robustness.
  • Figure 4: Comparison of attention features across diffusion timesteps. Red box denote the configuration used in our method.
  • Figure 5: Robustness against unseen corruptions. AUROC (%) across five severity levels. Per corruption scales are shown on the $x$–axes. Average denotes the mean AUROC across all corruptions at each severity level.
  • ...and 17 more figures