Table of Contents
Fetching ...

Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach

Fan Nie, Jiangqun Ni, Jian Zhang, Bin Zhang, Weizhe Zhang, Bin Li

TL;DR

This work tackles the generalization gap in audio-visual deepfake detection by freezing pre-trained backbones and injecting Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). It introduces two core modules: Global-Local Forgery-aware Adaptation (GLFA) to capture high-frequency intra-modal forgery traces, and Variational Bayesian Forgery Estimation (VBFE) to model audio-visual correlations as Gaussian latent variables and learn them via variational Bayes. The latent space is factorized into modality-specific and correlation-specific components with an orthogonality constraint, optimized through an augmented ELBO that incorporates a dynamic prior $f_K$ and Jensen-Shannon divergence. Extensive experiments across FakeAVCeleb, KoDF, DeAVMiT, and DFDC show improved generalization and robustness, even with limited training data, highlighting FoVB’s practical potential for real-world deepfake forensics.

Abstract

The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.

Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach

TL;DR

This work tackles the generalization gap in audio-visual deepfake detection by freezing pre-trained backbones and injecting Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). It introduces two core modules: Global-Local Forgery-aware Adaptation (GLFA) to capture high-frequency intra-modal forgery traces, and Variational Bayesian Forgery Estimation (VBFE) to model audio-visual correlations as Gaussian latent variables and learn them via variational Bayes. The latent space is factorized into modality-specific and correlation-specific components with an orthogonality constraint, optimized through an augmented ELBO that incorporates a dynamic prior and Jensen-Shannon divergence. Extensive experiments across FakeAVCeleb, KoDF, DeAVMiT, and DFDC show improved generalization and robustness, even with limited training data, highlighting FoVB’s practical potential for real-world deepfake forensics.

Abstract

The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.

Paper Structure

This paper contains 44 sections, 15 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Difference between the proposed FoVB and previous methods. Previous methods leverage massive samples to fully fine-tune the pre-trained backbones, e.g., ViT, ResNet, etc, and capture forgery artifacts, whose latent variables are distributed as unknown. In contrast, with the frozen pre-trained backbones, our FoVB adopts two core designs, e.g., Glocal-Local Forgery-aware Adaptation (GLFA) and Variational Bayesian Forgery Estimation (VBFE), to capture intra-modal (audio or visual) and cross-modal (audio-visual) forgery traces characterized by Gaussian distributed latent variables and thus learn more generalizable forgery representation.
  • Figure 2: Overview of the proposed FoVB framework. The audio-visual sequences are fed into pre-trained backbones, where the GLFA extracts intra-modal forgery features, and the VBFE exploits the extracted features by the $i$-th transformer block to estimate audio-visual latent variables with variational Bayes. Then, the estimated variables facilitate the adaptation of forgery-relevant knowledge. Finally, we leverage the classification heads of audio-visual sequences, i.e., $\mathrm{head}_{v}$ and $\mathrm{head}_{a}$, to determine if forgery exists.
  • Figure 3: Implementation of prior and posterior encoders in VBFE.$x_{a}, x_{v}$ are fed into the encoders $\theta, \phi$ to estimate the mean and variance values $\mu, \delta$ of latent variables. (a) Architecture of the prior encoder $\theta$. (b) Architecture of the posterior encoder $\phi$, which incorporates label embedding with cross-attention operations to estimate the posterior distribution. (c) Pipeline of variable adaptation.
  • Figure 4: Robustness to Unseen Perturbations. Following the setting of AVFF, we report AUC scores (%) for various perturbations with different intensities evaluated on the test set of FakeAVCeleb. Specifically, the first seven perturbations attack visual contents, while the remaining ones are audio perturbations.
  • Figure 5: Analysis of the Factorized Variables. In the first row, based on forgery categories, we visualize the mean values of audio-visual samples via t-SNE projection while we present average variance values in each feature dimension.
  • ...and 3 more figures