Table of Contents
Fetching ...

Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes

Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, Ziyou Liang, Run Wang

TL;DR

This paper proposes a novel approach dedicated to lip-forgery identification that exploits the inconsistency between lip movements and audio signals and mimics human natural cognition by capturing subtle biological links between lips and head regions to boost accuracy.

Abstract

In recent years, DeepFake technology has achieved unprecedented success in high-quality video synthesis, but these methods also pose potential and severe security threats to humanity. DeepFake can be bifurcated into entertainment applications like face swapping and illicit uses such as lip-syncing fraud. However, lip-forgery videos, which neither change identity nor have discernible visual artifacts, present a formidable challenge to existing DeepFake detection methods. Our preliminary experiments have shown that the effectiveness of the existing methods often drastically decrease or even fail when tackling lip-syncing videos. In this paper, for the first time, we propose a novel approach dedicated to lip-forgery identification that exploits the inconsistency between lip movements and audio signals. We also mimic human natural cognition by capturing subtle biological links between lips and head regions to boost accuracy. To better illustrate the effectiveness and advances of our proposed method, we create a high-quality LipSync dataset, AVLips, by employing the state-of-the-art lip generators. We hope this high-quality and diverse dataset could be well served the further research on this challenging and interesting field. Experimental results show that our approach gives an average accuracy of more than 95.3% in spotting lip-syncing videos, significantly outperforming the baselines. Extensive experiments demonstrate the capability to tackle deepfakes and the robustness in surviving diverse input transformations. Our method achieves an accuracy of up to 90.2% in real-world scenarios (e.g., WeChat video call) and shows its powerful capabilities in real scenario deployment. To facilitate the progress of this research community, we release all resources at https://github.com/AaronComo/LipFD.

Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes

TL;DR

This paper proposes a novel approach dedicated to lip-forgery identification that exploits the inconsistency between lip movements and audio signals and mimics human natural cognition by capturing subtle biological links between lips and head regions to boost accuracy.

Abstract

In recent years, DeepFake technology has achieved unprecedented success in high-quality video synthesis, but these methods also pose potential and severe security threats to humanity. DeepFake can be bifurcated into entertainment applications like face swapping and illicit uses such as lip-syncing fraud. However, lip-forgery videos, which neither change identity nor have discernible visual artifacts, present a formidable challenge to existing DeepFake detection methods. Our preliminary experiments have shown that the effectiveness of the existing methods often drastically decrease or even fail when tackling lip-syncing videos. In this paper, for the first time, we propose a novel approach dedicated to lip-forgery identification that exploits the inconsistency between lip movements and audio signals. We also mimic human natural cognition by capturing subtle biological links between lips and head regions to boost accuracy. To better illustrate the effectiveness and advances of our proposed method, we create a high-quality LipSync dataset, AVLips, by employing the state-of-the-art lip generators. We hope this high-quality and diverse dataset could be well served the further research on this challenging and interesting field. Experimental results show that our approach gives an average accuracy of more than 95.3% in spotting lip-syncing videos, significantly outperforming the baselines. Extensive experiments demonstrate the capability to tackle deepfakes and the robustness in surviving diverse input transformations. Our method achieves an accuracy of up to 90.2% in real-world scenarios (e.g., WeChat video call) and shows its powerful capabilities in real scenario deployment. To facilitate the progress of this research community, we release all resources at https://github.com/AaronComo/LipFD.
Paper Structure (26 sections, 8 equations, 10 figures, 6 tables)

This paper contains 26 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: A visualization comparison between common deepfakes and our studied lip-syncing deepfakes (LipSync). The former exhibits a substantial forgery area and identity manipulation, such as face or gender swapping, whereas the latter, relies on the synchronization of the minor lip region and given audio, without any alterations to the subject's identity. As illustrated in the comparison above, discerning the authenticity of an image sequence becomes arduous in the absence of labels.
  • Figure 2: (a) shows the correlation between lip movements and corresponding spectrogram in genuine pattern. When the woman starts talking, the middle and high frequencies in the spectrum are lighted. Over time, the energy gradually fades and shifts from middle to lower frequencies. (b) the first two frames show a highlighted high-frequency spectrum, contradicting the man not speaking. In the third frame, an unexpected lip opening appears at the darkest part of the spectrum. The mouth cannot change so drastically within a single frame, and this lip shape contradicts the spectrum information.
  • Figure 3: AVLips dataset construction. Utilizing static and dynamic methods, we generated high-quality videos with realistic lip movements. The diverse dataset includes various real-world scenarios. Perturbations were applied for robust model training.
  • Figure 4: Overview of LipFD framework. Blue components represent our main modules in LipFD. The input image was generated by pre-processing, which consists of $T$ frames in the target video and their audio spectrogram. (a) The aim of Global Feature Encoder, a self-attention model, is to extract long-term information between video frames and audio, finding unreasonable correspondences between lip movements and audio. (b) $E_{GR}$ encodes three series of crops, focusing on different parts for each region, and concatenates them with global feature $F_G$. (c) The Region Awareness module assigns corresponding weights to the features based on their importance. (d) All features are fused together into a unified representation $F$ based on their respective weights for final inference.
  • Figure 5: Robustness against various unseen corruptions. Average AUC scores across five intensity levels for various corruptions. For detailed analysis, please refer to the appendix.
  • ...and 5 more figures