Table of Contents
Fetching ...

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi, Yusuke Yasuda, Yu Tsao, Chia-Wen Lin, Yan-Tsung Peng, Hsin-Min Wang

Abstract

Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Abstract

Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.

Paper Structure

This paper contains 21 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Prior visual/audio-visual deepfake detectors typically rely on manipulated training data and task-specific preprocessing, using either visual-only models or audio-visual synchronization cues with ad-hoc fusion, which can lead to poor generalization under unseen generators. In contrast, the proposed model, SAVe, is trained exclusively on authentic videos by generating on-the-fly, identity-preserving self-blended pseudo-manipulations. SAVe combines FaceBlend, LipBlend, LowerFaceBlend, and AVSync features via a Fusion head for robust real/fake prediction.
  • Figure 2: Overview of the proposed SAVe framework, a self-supervised learning (SSL) audio-visual deepfake detection method trained solely using authentic data. Given an input video, visual frames and audio are extracted, in the AVSync branch, both modalities are encoded by AV-HuBERT to obtain audio ($F_a$) and visual ($F_v$) representations, which are then processed by an alignment network to produce an audio-visual misalignment score that reflects lip-speech synchronization consistency. In parallel, the SS-VPFG module synthesizes on-the-fly pseudo-forged visual samples via identity-preserving self-blending over multiple facial regions, enabling region-aware supervision: (face (real $\text{I}_{\text{FR}}$ vs. fake $\text{I}_{\text{FB}}$), lip (real $\text{I}_{\text{LR}}$ vs. fake $\text{I}_{\text{LB}}$), and lower-face (real $\text{I}_{\text{LFR}}$ vs. fake $\text{I}_{\text{LFB}}$)) to provide region-aware supervision. Finally, region-specific visual features and AVSync misalignment cues are aggregated by the Fusion module to output the final prediction (Real/Fake).
  • Figure 3: Overview of the proposed Self-Supervised Visual Pseudo-Forgery Generation (SS-VPFG) module. Given a Base Image, the module first performs source-target augmentation and facial landmark detection to produce aligned source and target images. Three region-specific blending pipelines are then applied: FaceBlend, LipBlend, and LowerFaceBlend. Each pipeline extracts the corresponding region of interest (ROI), generates region masks, applies targeted augmentations, and outputs forged variants, namely $\text{I}_{\text{FB}}$, $\text{I}_{\text{LB}}$, and $\text{I}_{\text{LFB}}$, alongside their intermediate augmented results ($\text{I}_{\text{FR}}$, $\text{I}_{\text{LR}}$, and $\text{I}_{\text{LFR}}$). These pseudo forgeries provide diverse self-supervised signals for training visual forgery detectors.