Table of Contents
Fetching ...

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos

Sahibzada Adil Shahzad, Ammarah Hashmi, Yan-Tsung Peng, Yu Tsao, Hsin-Min Wang

TL;DR

This work addresses the challenge of detecting audio-visual deepfakes in frontal-face videos by leveraging self-supervised AV-HuBERT-based features alongside a ViViT-based facial encoder. The AV-Lip-Sync+ architecture fuses multimodal synchronization cues and spatiotemporal facial artifacts through a transformer-augmented pipeline and MS-TCN for temporal reasoning, achieving state-of-the-art results on FakeAVCeleb and DeepfakeTIMIT. Key contributions include a Sync-Check Module that quantifies lip-sync inconsistencies, a robust feature fusion strategy, and the addition of a full-face encoder to handle non-lip-manipulated scenarios. Experimental results demonstrate superior accuracy and robustness, highlighting the practical impact for timely multimedia forensics in real-world settings, with future work focusing on generalization across datasets and manipulation types.

Abstract

Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos

TL;DR

This work addresses the challenge of detecting audio-visual deepfakes in frontal-face videos by leveraging self-supervised AV-HuBERT-based features alongside a ViViT-based facial encoder. The AV-Lip-Sync+ architecture fuses multimodal synchronization cues and spatiotemporal facial artifacts through a transformer-augmented pipeline and MS-TCN for temporal reasoning, achieving state-of-the-art results on FakeAVCeleb and DeepfakeTIMIT. Key contributions include a Sync-Check Module that quantifies lip-sync inconsistencies, a robust feature fusion strategy, and the addition of a full-face encoder to handle non-lip-manipulated scenarios. Experimental results demonstrate superior accuracy and robustness, highlighting the practical impact for timely multimedia forensics in real-world settings, with future work focusing on generalization across datasets and manipulation types.

Abstract

Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.
Paper Structure (24 sections, 7 equations, 3 figures, 5 tables)

This paper contains 24 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of various deepfake manipulation techniques applied to a real audio-visual sample. The real sample (left) highlighted by the dark blue border contains the original video frames and corresponding audio waveform. The manipulated (fake) samples highlighted by dark red borders are generated using Wav2Lip, Faceswap, Faceswap-wav2Lip, Fsgan, Fsgan-wav2Lip, and RTVC (Real-Time Voice Cloning). The video frames highlighted by blue borders represent real frames, while the video frames highlighted by red borders represent manipulated (fake) frames. The blue waveforms represent real audio, while the red waveforms represent manipulated (fake) audio.
  • Figure 2: The proposed AV-Lip-Sync+ architecture for multimodal forgery detection. The lip image sequence is extracted from the input video, while the log filterbank energies are extracted from the audio track. The SSL pre-trained model consists of ResNet-18 for visual feature extraction, FFN for acoustic feature extraction, and a transformer encoder to extract spatiotemporal information from the visual and acoustic features. The extracted audio-visual features are further mapped through multi-scale temporal convolution network (MS-TCN), temporal pooling, and linear layer for classification.
  • Figure 3: ROC curves and AUC scores of the proposed AV-Lip-Sync+ method on various test sets of the FakeAVCeleb dataset.