AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Trevine Oorloff; Surya Koppisetti; Nicolò Bonettini; Divyaraj Solanki; Ben Colman; Yaser Yacoob; Ali Shahriyari; Gaurav Bharaj

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Trevine Oorloff, Surya Koppisetti, Nicolò Bonettini, Divyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj

TL;DR

The paper tackles the problem of robust video deepfake detection by leveraging audio-visual coherence rather than relying solely on visual cues or corpus-specific patterns.It proposes AVFF, a two-stage framework that first learns cross-modal representations from real videos via self-supervised contrastive and autoencoding objectives with a novel complementary masking and cross-modal fusion, then performs supervised deepfake classification using those representations.Empirical results show AVFF achieves state-of-the-art performance on FakeAVCeleb (98.6% accuracy and 99.1% AUC) and strong generalization across unseen manipulations and datasets, illustrating the value of explicit audio-visual correspondence learning.The work highlights the practical significance of multi-modal alignment for defense against evolving deepfake threats and points to future work on preserving uni-modal cues and extending to broader AV tasks.

Abstract

With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cues within the training corpus, thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage, where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 7 figures, 7 tables)

This paper contains 23 sections, 5 equations, 7 figures, 7 tables.

Introduction
Related Works
Multi-Modal Representation Learning
Deepfake Detection
Method
Preprocessing
Representation Learning Stage
Deepfake Classification Stage
Experiments and Results
Implementation
Evaluation and Discussion
Ablation Study
Conclusion, Limitations, and Future Work
Overview
Implementation Details
...and 8 more sections

Figures (7)

Figure 1: We use audio-visual correspondences for deepfake detection. Transformer-based encoders are used to extract audio and visual feature tokens, which are then masked complementarily. The visible audio tokens are sent through a learnable A2V network to predict the masked visual tokens. These predicted visual tokens are fused with the visible visual tokens to obtain the full visual embeddings. Full audio embeddings are obtained in a similar way using the V2A network. The audio/visual embeddings are then used for video reconstruction in the MAE sense, and subsequently for deepfake classification.
Figure 2: Audio-Visual Representation Learning Stage. A real input sample, $x \in \mathcal{D}_r$, with corresponding audio and visual tokens ($\bm{x_a}$, $\bm{x_v}$), is split along the temporal dimension, creating $K$ slices, $\{x_{a,t_i}\}_{i=1}^K$ and $\{x_{v,t_i}\}_{i=1}^K$ (illustrated with $K=8$ in the figure). The temporal slices are then encoded using unimodal transformers, $E_a$ and $E_v$, to yield feature embeddings $\bm{a}$ and $\bm{v}$. We then complementarily mask $50\%$ of the temporal slices in ($\bm{a}$, $\bm{v}$) with binary masks ($\bm{M}_a$, $\bm{M}_v$). The visible slices of $\bm{a}$ and $\bm{v}$ are passed through A2V and V2A networks respectively, to generate cross-modal slices $\bm{a_v}$ and $\bm{v_a}$. The masked slices of $\bm{a}$ and $\bm{v}$ are then replaced with the corresponding slices in $\bm{a_v}$ and $\bm{v_a}$. The resulting cross-modal fusion representations, $\bm{a'}$ and $\bm{v'}$, are input to unimodal decoders to obtain the audio and visual reconstructions, $\bm{\hat{x}_a}$ and $\bm{\hat{x}_v}$. For the learning, we use a dual-objective loss function, which computes the contrastive loss between the audio and visual feature embeddings and the autoencoder loss between the input and the reconstruction of the masked tokens.
Figure 3: Deepfake Classification Stage. Given a sample $x\in \mathcal{D}_{df}$, comprising of audio and visual inputs $\bm{x_a}$ and $\bm{x_v}$, we obtain the unimodal features ($\bm{a}, \bm{v}$) and the cross-modal embeddings ($\bm{a_v}, \bm{v_a}$). For each modality, the unimodal and cross-modal embeddings are concatenated to obtain ($\bm{f_a}, \bm{f_v}$). A classifier network is then trained to take ($\bm{f_a}, \bm{f_v}$) as input and predict if the input is real or fake.
Figure 4: The t-SNE Visualization of the Embeddings at the end of the Representation Learning Stage. A clear distinction is seen between the representations of real and fake videos, as well as between different deepfake categories. Further analysis indicates that samples of adjacent clusters are generated using the same deepfake algorithm, which we encircle manually to highlight the clusters.
Figure 5: Robustness to Unseen Visual Perturbations. We illustrate AUC scores (%) as a function of different levels of intensities for various visual perturbations evaluated on the test set of FakeAVCeleb. Our model is more robust than RealForensics haliassos2022lrealforensics, which is the current state-of-the-art in robustness to unseen visual perturbations.
...and 2 more figures

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

TL;DR

Abstract

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)