Table of Contents
Fetching ...

FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video Deepfake Detection

Gil Knafo, Ohad Fried

TL;DR

This work tackles the problem of detecting deepfakes that generalize across unseen manipulation techniques and datasets. It introduces FakeOut, a two-stage framework that pre-trains a multi-modal, out-of-domain self-supervised backbone (MMV) on large-scale video data and then fine-tunes it on in-domain deepfake data, leveraging both visual and auditory cues. The method, supported by a robust face-tracking and data enrichment pipeline, achieves state-of-the-art cross-dataset generalization on audio-visual deepfake benchmarks and demonstrates the value of out-of-domain multi-modal pre-training for detection tasks. The authors provide extensive ablations and analyses, highlighting the benefits of fine-tuning over linear probing, audio-visual enrichment, and a robust data-preprocessing pipeline, with code slated for release on GitHub.

Abstract

Video synthesis methods rapidly improved in recent years, allowing easy creation of synthetic humans. This poses a problem, especially in the era of social media, as synthetic videos of speaking humans can be used to spread misinformation in a convincing manner. Thus, there is a pressing need for accurate and robust deepfake detection methods, that can detect forgery techniques not seen during training. In this work, we explore whether this can be done by leveraging a multi-modal, out-of-domain backbone trained in a self-supervised manner, adapted to the video deepfake domain. We propose FakeOut; a novel approach that relies on multi-modal data throughout both the pre-training phase and the adaption phase. We demonstrate the efficacy and robustness of FakeOut in detecting various types of deepfakes, especially manipulations which were not seen during training. Our method achieves state-of-the-art results in cross-dataset generalization on audio-visual datasets. This study shows that, perhaps surprisingly, training on out-of-domain videos (i.e., not especially featuring speaking humans), can lead to better deepfake detection systems. Code is available on GitHub.

FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video Deepfake Detection

TL;DR

This work tackles the problem of detecting deepfakes that generalize across unseen manipulation techniques and datasets. It introduces FakeOut, a two-stage framework that pre-trains a multi-modal, out-of-domain self-supervised backbone (MMV) on large-scale video data and then fine-tunes it on in-domain deepfake data, leveraging both visual and auditory cues. The method, supported by a robust face-tracking and data enrichment pipeline, achieves state-of-the-art cross-dataset generalization on audio-visual deepfake benchmarks and demonstrates the value of out-of-domain multi-modal pre-training for detection tasks. The authors provide extensive ablations and analyses, highlighting the benefits of fine-tuning over linear probing, audio-visual enrichment, and a robust data-preprocessing pipeline, with code slated for release on GitHub.

Abstract

Video synthesis methods rapidly improved in recent years, allowing easy creation of synthetic humans. This poses a problem, especially in the era of social media, as synthetic videos of speaking humans can be used to spread misinformation in a convincing manner. Thus, there is a pressing need for accurate and robust deepfake detection methods, that can detect forgery techniques not seen during training. In this work, we explore whether this can be done by leveraging a multi-modal, out-of-domain backbone trained in a self-supervised manner, adapted to the video deepfake domain. We propose FakeOut; a novel approach that relies on multi-modal data throughout both the pre-training phase and the adaption phase. We demonstrate the efficacy and robustness of FakeOut in detecting various types of deepfakes, especially manipulations which were not seen during training. Our method achieves state-of-the-art results in cross-dataset generalization on audio-visual datasets. This study shows that, perhaps surprisingly, training on out-of-domain videos (i.e., not especially featuring speaking humans), can lead to better deepfake detection systems. Code is available on GitHub.
Paper Structure (34 sections, 6 equations, 15 figures, 6 tables)

This paper contains 34 sections, 6 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: FakeOut schematic overview. Adaption of out-of-domain self-supervised backbone to the video deepfake domain. FakeOut achieves state-of-the-art results without the reliance on extra facial datasets besides in-domain deepfake datasets for the fine-tuning phase.
  • Figure 2: Architecture of FakeOut --- Multi-Modal Learning System for Video Deepfake Detection. Left: pre-training phase on multi-modal videos out of the deepfake domain, via MMValayrac2020self. Right: Adaptation phase, implemented using fine-tuning, in which we adapt the backbone to the video deepfake detection task. We utilize cross-modality representations of audio and video obtained by the face detection pipeline and the enrichment process.
  • Figure 3: Our enrichment process. Each video in the FaceForensics++, DeeperForensics and FaceShifter datasets is enriched with the relevant audio file if it is available, according to this scheme.
  • Figure 4: Audio-visual features --- similarity. Cosine similarity of the auditory and the visual feature-vectors extracted from the FakeAVCeleb's videos. The similarity values of the RealVideo-RealAudio type distribute around much higher values than in the RealVideo-FakeAudio case.
  • Figure 5: Adaption ablation --- fine-tuning vs. linear probing. We evaluate FakeOut on the cross-dataset generalization task, using two adaptation approaches to the video deepfake detection domain. FakeOut is trained on FF++ train-set. We compare fine-tuning the whole network vs. linear probing.
  • ...and 10 more figures