Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

Soumyya Kanti Datta; Shan Jia; Siwei Lyu

Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

Soumyya Kanti Datta, Shan Jia, Siwei Lyu

TL;DR

This work tackles the challenging problem of detecting lip-syncing deepfakes by focusing on mouth-region spatiotemporal inconsistencies. It introduces LIPINC-V2, a detector that combines Local-Global Mouth Frame extraction with a Mouth Spatial-Temporal Inconsistency Extractor powered by a Vision Temporal Transformer and multi-head cross-attention, trained with a dedicated inconsistency loss. A new LipSyncTIMIT dataset is introduced to evaluate generalization to unseen lip-syncing models, and extensive experiments demonstrate state-of-the-art performance in both in-domain and cross-domain settings, as well as segment-wise localization capabilities. The approach shows strong robustness to common distortions and highlights the potential of cross-modal, mouth-focused analysis for practical deepfake detection and localization tasks.

Abstract

Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .

Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

TL;DR

Abstract

Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)