Table of Contents
Fetching ...

Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

Mulin Tian, Mahyar Khayatkhoei, Joe Mathai, Wael AbdAlmageed

TL;DR

This paper addresses the challenge of detecting deepfakes without reliance on labeled data or pristine real samples at inference. It introduces an information theoretic motivation that facial motion and identity are interdependent, yielding inevitable traces in fake videos, and develops unsupervised detectors based on intra-modal and cross-modal inconsistencies that are trained solely on real videos. The method yields two complementary consistency losses, integrates them into an Intra-Cross-modal score, and achieves state of the art unsupervised performance on FakeAVCeleb while closely matching supervised baselines, with strong generalization to KoDF and to compression and adversarial attacks. The approach is scalable, reliable, and explainable, providing localized inconsistency regions that can be inspected by human experts, making it suitable for real world forensic deployment.

Abstract

Deepfake videos present an increasing threat to society with potentially negative impact on criminal justice, democracy, and personal safety and privacy. Meanwhile, detecting deepfakes, at scale, remains a very challenging task that often requires labeled training data from existing deepfake generation methods. Further, even the most accurate supervised deepfake detection methods do not generalize to deepfakes generated using new generation methods. In this paper, we propose a novel unsupervised method for detecting deepfake videos by directly identifying intra-modal and cross-modal inconsistency between video segments. The fundamental hypothesis behind the proposed detection method is that motion or identity inconsistencies are inevitable in deepfake videos. We will mathematically and empirically support this hypothesis, and then proceed to constructing our method grounded in our theoretical analysis. Our proposed method outperforms prior state-of-the-art unsupervised deepfake detection methods on the challenging FakeAVCeleb dataset, and also has several additional advantages: it is scalable because it does not require pristine (real) samples for each identity during inference and therefore can apply to arbitrarily many identities, generalizable because it is trained only on real videos and therefore does not rely on a particular deepfake method, reliable because it does not rely on any likelihood estimation in high dimensions, and explainable because it can pinpoint the exact location of modality inconsistencies which are then verifiable by a human expert.

Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

TL;DR

This paper addresses the challenge of detecting deepfakes without reliance on labeled data or pristine real samples at inference. It introduces an information theoretic motivation that facial motion and identity are interdependent, yielding inevitable traces in fake videos, and develops unsupervised detectors based on intra-modal and cross-modal inconsistencies that are trained solely on real videos. The method yields two complementary consistency losses, integrates them into an Intra-Cross-modal score, and achieves state of the art unsupervised performance on FakeAVCeleb while closely matching supervised baselines, with strong generalization to KoDF and to compression and adversarial attacks. The approach is scalable, reliable, and explainable, providing localized inconsistency regions that can be inspected by human experts, making it suitable for real world forensic deployment.

Abstract

Deepfake videos present an increasing threat to society with potentially negative impact on criminal justice, democracy, and personal safety and privacy. Meanwhile, detecting deepfakes, at scale, remains a very challenging task that often requires labeled training data from existing deepfake generation methods. Further, even the most accurate supervised deepfake detection methods do not generalize to deepfakes generated using new generation methods. In this paper, we propose a novel unsupervised method for detecting deepfake videos by directly identifying intra-modal and cross-modal inconsistency between video segments. The fundamental hypothesis behind the proposed detection method is that motion or identity inconsistencies are inevitable in deepfake videos. We will mathematically and empirically support this hypothesis, and then proceed to constructing our method grounded in our theoretical analysis. Our proposed method outperforms prior state-of-the-art unsupervised deepfake detection methods on the challenging FakeAVCeleb dataset, and also has several additional advantages: it is scalable because it does not require pristine (real) samples for each identity during inference and therefore can apply to arbitrarily many identities, generalizable because it is trained only on real videos and therefore does not rely on a particular deepfake method, reliable because it does not rely on any likelihood estimation in high dimensions, and explainable because it can pinpoint the exact location of modality inconsistencies which are then verifiable by a human expert.
Paper Structure (9 sections, 4 equations, 4 figures, 3 tables)

This paper contains 9 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: When transferring the motion of the source video (top) to the target identity (Angelina Jolie), the deepfake generation method drobyshev2022megaportraits faces a trade-off: (middle) matching motion exactly results in some frames having the wrong identity which can be detected by looking for intra-modal identity inconsistency; or (bottom) matching identity exactly results in motion distortion which can be detected by looking for video cross-modal inconsistency with audio (e.g., the lips do not move at moments where audio magnitude shows speaking, and vice versa). Red boxes show inconsistencies.
  • Figure 2: Training and testing scheme for intra-modal consistency and cross-modal consistency methods. For each training batch, we take multiple fixed-size video and audio clips of $N$ distinct identities and feed them into our networks. ① is an output (feature vector) of all identities extracted from the identity network at time-window $t_a$, The similarity matrix computed on time dimension is given in the gray box on the left, each element ② represents the similarity matrix of ① on a specific time-window pair $(t_{a},t_{b})$. A intra-modal consistency loss for identity network training is calculated based on this tensor. ③ denotes the feature vector generated by video and audio network at time-window $t_{a}$. The features of each individual across multiple time windows are used to generate their corresponding similarity matrix. A cross-modal consistency loss for video and audio network training is calculated from these N 2-dimensional matrices.
  • Figure 3: The explainability of the proposed methods using intra-modal consistency loss and cross-modal consistency loss for two samples in FakeAVCeleb. When the method decides that a given video is fake due to its average score being lower than a threshold (light-gray boxes), it can provide the portions of the video with the minimum consistency score to a human expert as explanation (yellow boxes), and the expert can verify the method's decision through manual comparison.
  • Figure 4: The similarity matrices ($M_{intra}$) of the Intra-modal method clearly show the stronger temporal fluctuations of identity in deepfake (middle) compared to real (top). Intra-modal method AUC on FaceAVCeleb sorted based on the magnitude of motion in videos shows increasing performance with increasing motion in videos (bottom).