Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Jielun Peng; Yabin Wang; Yaqi Li; Long Kong; Xiaopeng Hong

Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Jielun Peng, Yabin Wang, Yaqi Li, Long Kong, Xiaopeng Hong

Abstract

The rapid progress of generative AI has enabled hyper-realistic audio-visual deepfakes, intensifying threats to personal security and social trust. Most existing deepfake detectors rely either on uni-modal artifacts or audio-visual discrepancies, failing to jointly leverage both sources of information. Moreover, detectors that rely on generator-specific artifacts tend to exhibit degraded generalization when confronted with unseen forgeries. We argue that robust and generalizable detection should be grounded in intrinsic audio-visual coherence within and across modalities. Accordingly, we propose HAVIC, a Holistic Audio-Visual Intrinsic Coherence-based deepfake detector. HAVIC first learns priors of modality-specific structural coherence, inter-modal micro- and macro-coherence by pre-training on authentic videos. Based on the learned priors, HAVIC further performs holistic adaptive aggregation to dynamically fuse audio-visual features for deepfake detection. Additionally, we introduce HiFi-AVDF, a high-fidelity audio-visual deepfake dataset featuring both text-to-video and image-to-video forgeries from state-of-the-art commercial generators. Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state-of-the-art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross-dataset scenario. Our code and dataset are available at https://github.com/tuffy-studio/HAVIC.

Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Abstract

Paper Structure (30 sections, 7 equations, 7 figures, 17 tables)

This paper contains 30 sections, 7 equations, 7 figures, 17 tables.

Introduction
Related Works
Audio-Visual Self-Supervised Learning
Deepfake Detection
Method
Overview
Holistic Coherence Priors Pre-training
Holistic Adaptive Aggregation Classification
HiFi-AVDF Dataset
Experiments
Experimental Setup
Main Results
Ablation Studies
Conclusion
Overview
...and 15 more sections

Figures (7)

Figure 1: Overview of our proposed HAVIC.(a) Holistic Coherence Priors Pre-training phase. The visible tokens $x_{v,vis}$ and $x_{a,vis}$ are encoded by $E_v$ and $E_a$ to produce hierarchical features $\{\bm{v}_l\}_{l=1}^H$ and $\{\bm{a}_l\}_{l=1}^H$. The high-level representations $\bm{v}_H$ and $\bm{a}_H$ are further processed by the Audio-Visual Interaction Module $\mathcal{I}$ to yield interaction-aware features $\bm{v}_{inter}$ and $\bm{a}_{inter}$. These features are decoded by modality-specific decoders $D_v$ and $D_a$, where each layer integrates hierarchical features for input reconstruction. In parallel, $\bm{v}_{inter}$ and $\bm{a}_{inter}$ are fed into cross-modal decoders $D_{v\to a}$ and $D_{a\to v}$ to reconstruct the counterpart semantics $\hat{\bm{a}}_{H_g}$ and $\hat{\bm{v}}_{H_g}$, supervised by gradient-stopped targets $\bm{a}_{H_g}$ and $\bm{v}_{H_g}$. (b) Holistic Adaptive Aggregation Classification phase. The pre-trained $E_v$, $E_a$, and $\mathcal{I}$ with learned holistic coherence priors are used to extract features from the input sample. These features are aggregated by the Adaptive Feature Aggregation module $\mathcal{A}$ and then fed into the classifiers to predict both modality-specific and overall authenticity.
Figure 1: Examples of real–fake video pairs generated by the six models in HiFi-AVDF. For each model, we display one representative pair, where the top row shows a real video clip and the bottom row shows the corresponding forged clip produced by that model.
Figure 2: Illustrations of three self-supervised objectives in the Holistic Coherence Priors Pre-training phase. (a) Each decoder layer reconstructs inputs using hierarchical encoder features, enforcing modality-specific structural coherence (illustrated with the visual modality). (b) Audio and visual features are temporally partitioned and aligned segment by segment, with a soft negative pairs strategy to capture inter-modal micro-coherence (only one temporal segment and the visual-to-audio direction are illustrated for simplicity). (c) One modality reconstructs the global semantic representation of the other, ensuring inter-modal macro-coherence.
Figure 2: Modality-Specific Hierarchical Reconstruction visualizations. For each clip, the first row shows the original audio spectrograms and visual frames, while the second and third rows depict the masked inputs and the corresponding reconstructions from the final decoder layer, respectively. Details of the reconstructions can be seen by zooming in.
Figure 3: Overview of the HiFi-AVDF dataset. The left panel shows the deepfake generation process, where audio tracks, reference frames, and video captions of real videos are fed into generation models. The right panel shows dataset statistics, detailing the proportion of samples generated by each generative model and the distribution of different manipulation strategies. The bottom panels present representative examples generated by different models.
...and 2 more figures

Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Abstract

Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Authors

Abstract

Table of Contents

Figures (7)