Table of Contents
Fetching ...

Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies

Marcella Astrid, Enjie Ghorbel, Djamila Aouada

TL;DR

This work introduces a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio and introduces a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in the authors' training set.

Abstract

Existing methods on audio-visual deepfake detection mainly focus on high-level features for modeling inconsistencies between audio and visual data. As a result, these approaches usually overlook finer audio-visual artifacts, which are inherent to deepfakes. Herein, we propose the introduction of fine-grained mechanisms for detecting subtle artifacts in both spatial and temporal domains. First, we introduce a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio. For that purpose, a fine-grained mechanism based on a spatially-local distance coupled with an attention module is adopted. Second, we introduce a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in our training set. Experiments on the DFDC and the FakeAVCeleb datasets demonstrate the superiority of the proposed method in terms of generalization as compared to the state-of-the-art under both in-dataset and cross-dataset settings.

Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies

TL;DR

This work introduces a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio and introduces a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in the authors' training set.

Abstract

Existing methods on audio-visual deepfake detection mainly focus on high-level features for modeling inconsistencies between audio and visual data. As a result, these approaches usually overlook finer audio-visual artifacts, which are inherent to deepfakes. Herein, we propose the introduction of fine-grained mechanisms for detecting subtle artifacts in both spatial and temporal domains. First, we introduce a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio. For that purpose, a fine-grained mechanism based on a spatially-local distance coupled with an attention module is adopted. Second, we introduce a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in our training set. Experiments on the DFDC and the FakeAVCeleb datasets demonstrate the superiority of the proposed method in terms of generalization as compared to the state-of-the-art under both in-dataset and cross-dataset settings.
Paper Structure (22 sections, 6 equations, 15 figures, 2 tables)

This paper contains 22 sections, 6 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: (a) Previous works chugh2020notgu2021deepfake utilize high-level global features to measure inconsistencies between audio and visual data. (b) The proposed method measures the inconsistency between different visual regions and the audio input.
  • Figure 1: Ablation study of our work reported in terms of AUC on DFDC and FakeAVCeleb (FAV) datasets. "Att.", "PF", and "RC" represent Attention, Pseudo-Fakes, and Residual Connections, respectively. The results produced by our method are reported in (d). The best and second-best performances are marked with bold and underlined, respectively.
  • Figure 2: (a) The proposed temporally-local pseudo-fake synthesis involves the replacement of a small video segment by a subsequence extracted from another video (marked in blue). (b) The same strategy is followed for audio data.
  • Figure 3: The proposed spatially-local deepfake detector: Firstly, audio and visual features are extracted, separately. Next, we compute the distance and attention maps between the audio and all spatial positions of the visual features. Subsequently, the distance map and the attention map are multiplied before being fed into a single-layer real/fake classifier.
  • Figure 4: Our temporally-local pseudo-fake data synthesis: Given the original dataset illustrated in (a), we can create three types of pseudo-fakes: modifying only the audio data, modifying only the visual data, or modifying both the audio and visual inputs, as illustrated in (b).
  • ...and 10 more figures