Table of Contents
Fetching ...

Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

Soumyya Kanti Datta, Shan Jia, Siwei Lyu

TL;DR

This work tackles the challenging problem of detecting lip-syncing deepfakes by focusing on mouth-region spatiotemporal inconsistencies. It introduces LIPINC-V2, a detector that combines Local-Global Mouth Frame extraction with a Mouth Spatial-Temporal Inconsistency Extractor powered by a Vision Temporal Transformer and multi-head cross-attention, trained with a dedicated inconsistency loss. A new LipSyncTIMIT dataset is introduced to evaluate generalization to unseen lip-syncing models, and extensive experiments demonstrate state-of-the-art performance in both in-domain and cross-domain settings, as well as segment-wise localization capabilities. The approach shows strong robustness to common distortions and highlights the potential of cross-modal, mouth-focused analysis for practical deepfake detection and localization tasks.

Abstract

Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .

Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

TL;DR

This work tackles the challenging problem of detecting lip-syncing deepfakes by focusing on mouth-region spatiotemporal inconsistencies. It introduces LIPINC-V2, a detector that combines Local-Global Mouth Frame extraction with a Mouth Spatial-Temporal Inconsistency Extractor powered by a Vision Temporal Transformer and multi-head cross-attention, trained with a dedicated inconsistency loss. A new LipSyncTIMIT dataset is introduced to evaluate generalization to unseen lip-syncing models, and extensive experiments demonstrate state-of-the-art performance in both in-domain and cross-domain settings, as well as segment-wise localization capabilities. The approach shows strong robustness to common distortions and highlights the potential of cross-modal, mouth-focused analysis for practical deepfake detection and localization tasks.

Abstract

Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .

Paper Structure

This paper contains 18 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Illustration of the mouth inconsistency in lip-syncing deepfakes. We visualize video frames from two lip-synced videos. Here T represents frame number. The first five columns present consecutive frames which are 0.03 secs apart for local comparison, while the last three columns offer a broader perspective by displaying frames with similar poses from the entire video, defined as global inconsistencies in our paper. The deepfakes exhibit more pronounced inconsistencies in aspects such as mouth shape, coloration, dental structure, and tongue appearance.
  • Figure 2: End to End pipeline of the proposed LIPINC-V2 model. Our approach comprises two main modules: (1) Local and Global Mouth Frame Extractor, responsible for isolating adjacent, and similarly posed mouth segments based on mouth openness over time; and (2) Mouth Spatial-Temporal Inconsistency Extractor, tasked with learning distinctive inconsistency features both within and across frames by leveraging mouth appearance and delta frames.
  • Figure 3: Pipeline of the Local & Global Mouth Frame Extractor. Here T represents frame number.
  • Figure 4: Mouth region landmarks detected by Dlib Dlib. Orange colors denote the landmarks for mouth openness measurement and matching.
  • Figure 5: The architecture of the Vision Temporal Transformer and the Multihead Cross-Attention Block within our Mouth Spatial-Temporal Inconsistency Extractor. This module encodes RGB mouth and delta frames to learn spatial and temporal inconsistencies using a Vision Temporal Transformer. Through stacked spatial and temporal encoders, it captures relationships within frames and dependencies across frames, focusing on discrepancies in the mouth region. To fully leverage both feature streams, a multi-head cross-attention block interconnects RGB mouth and delta frame branches, ensuring a robust deepfake detection architecture.
  • ...and 5 more figures