Table of Contents
Fetching ...

Statistics-aware Audio-visual Deepfake Detector

Marcella Astrid, Enjie Ghorbel, Djamila Aouada

TL;DR

This work addresses limitations in audio-visual deepfake detection by introducing SADD, a shallow AV detector that uses raw waveform audio input, a post-processing normalization, and a statistics-aware loss to separate real and fake audio-visual feature distributions. By combining an Enhanced Modality Dissonance Score with a first-order statistics distance, the method achieves competitive or superior AUC on the DFDC benchmark while reducing model complexity. Ablation studies confirm that each component—waveform input, normalization, and the statistics-aware loss—contributes to performance gains, and cross-dataset testing on FakeAVCeleb suggests improved generalization, though still short of the strongest SoA methods. Overall, SADD offers a more efficient, robust approach to AV deepfake detection with practical implications for real-time or resource-constrained settings.

Abstract

In this paper, we propose an enhanced audio-visual deep detection method. Recent methods in audio-visual deepfake detection mostly assess the synchronization between audio and visual features. Although they have shown promising results, they are based on the maximization/minimization of isolated feature distances without considering feature statistics. Moreover, they rely on cumbersome deep learning architectures and are heavily dependent on empirically fixed hyperparameters. Herein, to overcome these limitations, we propose: (1) a statistical feature loss to enhance the discrimination capability of the model, instead of relying solely on feature distances; (2) using the waveform for describing the audio as a replacement of frequency-based representations; (3) a post-processing normalization of the fakeness score; (4) the use of shallower network for reducing the computational complexity. Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.

Statistics-aware Audio-visual Deepfake Detector

TL;DR

This work addresses limitations in audio-visual deepfake detection by introducing SADD, a shallow AV detector that uses raw waveform audio input, a post-processing normalization, and a statistics-aware loss to separate real and fake audio-visual feature distributions. By combining an Enhanced Modality Dissonance Score with a first-order statistics distance, the method achieves competitive or superior AUC on the DFDC benchmark while reducing model complexity. Ablation studies confirm that each component—waveform input, normalization, and the statistics-aware loss—contributes to performance gains, and cross-dataset testing on FakeAVCeleb suggests improved generalization, though still short of the strongest SoA methods. Overall, SADD offers a more efficient, robust approach to AV deepfake detection with practical implications for real-time or resource-constrained settings.

Abstract

In this paper, we propose an enhanced audio-visual deep detection method. Recent methods in audio-visual deepfake detection mostly assess the synchronization between audio and visual features. Although they have shown promising results, they are based on the maximization/minimization of isolated feature distances without considering feature statistics. Moreover, they rely on cumbersome deep learning architectures and are heavily dependent on empirically fixed hyperparameters. Herein, to overcome these limitations, we propose: (1) a statistical feature loss to enhance the discrimination capability of the model, instead of relying solely on feature distances; (2) using the waveform for describing the audio as a replacement of frequency-based representations; (3) a post-processing normalization of the fakeness score; (4) the use of shallower network for reducing the computational complexity. Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.
Paper Structure (20 sections, 11 equations, 4 figures, 3 tables)

This paper contains 20 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Histograms of features extracted from three real and three fake samples using an enhanced version of MDS. Each column represents the feature values in different range.
  • Figure 2: Our method consists of visual and audio feature extractors that extract respectively features $\mathbf f^v$ and $\mathbf f^a$ from image sequences $\mathbf I^v$ and audio waveforms $\mathbf I^a$. Two separate classification layers are integrated on top of each extractor. The model is trained using a cross-entropy loss for each network, along with a feature distance loss. To enhance feature discrimination between real and fake data, we introduce an additional feature statistics-aware loss.
  • Figure 3: Feature distribution histograms of real and fake data with different values of $\alpha$. Adding the statistics-aware loss ($\alpha > 0$) results in distinguishable distribution characteristics between real and fake data.
  • Figure 4: Phenomena observed in Fig. \ref{['fig:feat_vis_baseline_3sample']} are also evident in other models without statistics-aware loss: (a) Deep network with mel-spectrogram audio input (Table \ref{['tab:baseline_comparisons']}(b)); (b) Deep network with waveform audio input (Table \ref{['tab:baseline_comparisons']}(c)); (c) Shallow network with waveform audio input (Table \ref{['tab:baseline_comparisons']}(d)) trained with the smaller set.