Table of Contents
Fetching ...

Exposing Lip-syncing Deepfakes from Mouth Inconsistencies

Soumyya Kanti Datta, Shan Jia, Siwei Lyu

TL;DR

This paper addresses the challenge of detecting lip-syncing deepfakes, which manipulate only the mouth region and evade many existing detectors. It introduces LIPINC, a two-branch architecture that jointly captures local and global mouth poses and a Mouth Spatial-Temporal Inconsistency Extractor guided by an inconsistency loss to distinguish real from fake videos. Across in-domain and cross-domain benchmarks, LIPINC achieves state-of-the-art or competitive performance on lip-sync datasets, with ablation analyses confirming the importance of each component and the potential limitations when facing non-lip-sync face-swaps. The work advances practical deepfake detection by focusing on mouth-region artifacts and provides code for reproducibility, with future work aimed at integrating audio cues for improved accuracy.

Abstract

A lip-syncing deepfake is a digitally manipulated video in which a person's lip movements are created convincingly using AI models to match altered or entirely new audio. Lip-syncing deepfakes are a dangerous type of deepfakes as the artifacts are limited to the lip region and more difficult to discern. In this paper, we describe a novel approach, LIP-syncing detection based on mouth INConsistency (LIPINC), for lip-syncing deepfake detection by identifying temporal inconsistencies in the mouth region. These inconsistencies are seen in the adjacent frames and throughout the video. Our model can successfully capture these irregularities and outperforms the state-of-the-art methods on several benchmark deepfake datasets. Code is available at https://github.com/skrantidatta/LIPINC

Exposing Lip-syncing Deepfakes from Mouth Inconsistencies

TL;DR

This paper addresses the challenge of detecting lip-syncing deepfakes, which manipulate only the mouth region and evade many existing detectors. It introduces LIPINC, a two-branch architecture that jointly captures local and global mouth poses and a Mouth Spatial-Temporal Inconsistency Extractor guided by an inconsistency loss to distinguish real from fake videos. Across in-domain and cross-domain benchmarks, LIPINC achieves state-of-the-art or competitive performance on lip-sync datasets, with ablation analyses confirming the importance of each component and the potential limitations when facing non-lip-sync face-swaps. The work advances practical deepfake detection by focusing on mouth-region artifacts and provides code for reproducibility, with future work aimed at integrating audio cues for improved accuracy.

Abstract

A lip-syncing deepfake is a digitally manipulated video in which a person's lip movements are created convincingly using AI models to match altered or entirely new audio. Lip-syncing deepfakes are a dangerous type of deepfakes as the artifacts are limited to the lip region and more difficult to discern. In this paper, we describe a novel approach, LIP-syncing detection based on mouth INConsistency (LIPINC), for lip-syncing deepfake detection by identifying temporal inconsistencies in the mouth region. These inconsistencies are seen in the adjacent frames and throughout the video. Our model can successfully capture these irregularities and outperforms the state-of-the-art methods on several benchmark deepfake datasets. Code is available at https://github.com/skrantidatta/LIPINC
Paper Structure (16 sections, 5 equations, 4 figures, 1 table)

This paper contains 16 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the mouth inconsistency in lip-syncing deepfakes. We visualize video frames from FakeAVCeleb FakeAVCeleb dataset with open mouths. Here f represents frame number. The first three columns present consecutive frames for local comparison, while the last two columns offer a broader perspective by displaying frames with similar poses from the entire video, defined as global inconsistencies in our paper. The deepfakes exhibit more pronounced inconsistencies in aspects such as mouth shape, coloration, dental structure, and tongue appearance.
  • Figure 2: End to End pipeline of the proposed LIPINC model.
  • Figure 3: Mouth region landmarks detected by Dlib Dlib. Orange colors denote the landmarks for mouth openness measurement and matching.
  • Figure 4: Ablation analysis on LSR+W2L and KODF-LS Dataset based on AUC scores. Here L, G, C, and S refer to Local, Global, Color, and Structure frames, respectively. IL is Inconsistency Loss.