Table of Contents
Fetching ...

Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning

Stefan Smeu, Dragos-Alexandru Boldisor, Dan Oneata, Elisabeta Oneata

TL;DR

The paper reveals a leading-silence bias in two widely used audio-visual deepfake datasets that can be exploited by supervised models, inflating perceived performance. It proposes AVH-Align, an unsupervised method that aligns AV-HuBERT audio and visual representations via a learnable alignment network trained only on real data, mitigating dataset-specific shortcuts. The results show that the bias can yield high AUCs even after trimming, while AVH-Align achieves robust detection (e.g., 85.24% $AUC$ on AV-Deepfake1M test) without using fake data and outperforms other unsupervised baselines, highlighting the importance of bias-aware evaluation. The work argues for dataset design scrutiny and promotes a practical real-data, self-supervised framework to enhance generalization across manipulation techniques.

Abstract

Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection - the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.

Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning

TL;DR

The paper reveals a leading-silence bias in two widely used audio-visual deepfake datasets that can be exploited by supervised models, inflating perceived performance. It proposes AVH-Align, an unsupervised method that aligns AV-HuBERT audio and visual representations via a learnable alignment network trained only on real data, mitigating dataset-specific shortcuts. The results show that the bias can yield high AUCs even after trimming, while AVH-Align achieves robust detection (e.g., 85.24% on AV-Deepfake1M test) without using fake data and outperforms other unsupervised baselines, highlighting the importance of bias-aware evaluation. The work argues for dataset design scrutiny and promotes a practical real-data, self-supervised framework to enhance generalization across manipulation techniques.

Abstract

Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection - the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.

Paper Structure

This paper contains 14 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Audio-visual deepfake detection datasets have a silence bias: fake samples start with a brief moment of silence, which is not the case for real samples. Here we show the first 62.5 ms of the audio waveform for a real and the corresponding fake sample from the AV-Deepfake1M dataset cai2024avdeepfake1mlargescalellmdrivenaudiovisual.
  • Figure 2: Normalized distribution plots of the leading silence duration for real and fake videos in the FakeAVCeleb (left) and AV-Deepfake1M (right) datasets. The fake samples start with 25--30 ms of silence.
  • Figure 3: Left: The impact of the silence threshold on the leading silence classifier. Right: The impact of the leading duration on the maximum amplitude classifier.
  • Figure 4: Overview of the AVH-Align method. A: We use the pretrained AV-HuBERT model to extract self-supervised features which we further align with a learnable network $\Phi$. Note that we use a single AV-HuBERT model, but make two forward passes to obtain audio-only and video-only features (instead of a single set of multimodal features). B: At training we maximize the alignment score $\Phi_{ii}$, between the audio features $\mathbf{a}_i$ at time step $i$ and the corresponding video features $\mathbf{v}_i$, while minimizing the alignment $\Phi_{ik}$ to the other features $\mathbf{v}_k$ in a neighboring window $\mathcal{N}(i)$.
  • Figure 5: Per frame fakeness probabilities for AVH-Align and AVH-Align/sup on AV-Deepfake1M. AVH-Align/sup always marks the first frame---corresponding usually to the leading silence---as fake, thus confirming that it uses the bias to distinguish between real and fake videos. AVH-Align is not affected by the presence of the leading silence. The fakeness probabilities for AVH-Align can be interpreted as misalignment probabilities, which is why they are higher during or after the manipulated region.