Table of Contents
Fetching ...

Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization

Ashutosh Anshul, Shreyas Gopal, Deepu Rajan, Eng Siong Chng

TL;DR

The paper tackles the challenge of generalizing multimodal deepfake detection while enabling precise temporal localization. It introduces a single-stage framework that combines unimodal and cross-modal embeddings with three masked-prediction modules, a causal transformer backbone, and local convolutional attention to detect intra- and inter-modal inconsistencies. By leveraging next-frame feature prediction and frame-level contrastive guidance, the method achieves strong cross-manipulation and cross-dataset generalization and sets new benchmarks for temporal localization on Lav-DF. The approach maintains a common backbone for both detection and localization, offering a practical, interpretable and scalable solution with realistic inference costs. Overall, it advances robust multimodal deepfake detection and granular localization without requiring two-stage pretraining.

Abstract

Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.

Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization

TL;DR

The paper tackles the challenge of generalizing multimodal deepfake detection while enabling precise temporal localization. It introduces a single-stage framework that combines unimodal and cross-modal embeddings with three masked-prediction modules, a causal transformer backbone, and local convolutional attention to detect intra- and inter-modal inconsistencies. By leveraging next-frame feature prediction and frame-level contrastive guidance, the method achieves strong cross-manipulation and cross-dataset generalization and sets new benchmarks for temporal localization on Lav-DF. The approach maintains a common backbone for both detection and localization, offering a practical, interpretable and scalable solution with realistic inference costs. Overall, it advances robust multimodal deepfake detection and granular localization without requiring two-stage pretraining.

Abstract

Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.

Paper Structure

This paper contains 31 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Proposed Pipeline: We extract unimodal embeddings and fuse them to create cross-modal features. Three masked-prediction modules detect intra-modal and cross-modal inconsistencies by predicting next-frame features and capturing deviations between predicted and actual features. We then fuse the intra-modal and cross-modal features through alternating cross-attention layers, and finally use the combined output for deepfake detection or temporal localization.
  • Figure 2: Cross-Modal Feature Fusion: Cross-modal features are formed by concatenating visual and audio encodings, processed through linear layers to learn a refined fusion representation.
  • Figure 3: Masked-Prediction based Feature Extraction: We extract intra-modal and cross-modal inconsistencies by measuring deviations between predicted and actual frame-level features. Local convolution-based attention is applied to detect inconsistencies. To enhance adaptability for both classification and localization, frame-level contrastive loss is applied, ensuring the model learns to distinguish real and manipulated frames effectively.
  • Figure 4: Regression Head: We extract the outputs from the intra-modal and cross-modal Masked-Prediction Feature Extraction modules, denoted by A, V, and C. We then concatenate these three features along the feature dimension and pass them through the adapted UMMAFormer zhang2023ummaformer model.
  • Figure 5: We process one video sample from each category: (a) Real Visual–Real Audio, (b) Real Visual–Fake Audio, (c) Fake Visual–Fake Audio, and (d) Partial Deepfake through the trained models. We extract intra- and cross-modal frame features, compute absolute pointwise differences, and visualize them as heatmaps for analysis.
  • ...and 1 more figures