Table of Contents
Fetching ...

Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations

Tharun Anand, Siva Sankar Sajeev, Pravin Nair

TL;DR

This paper tackles the challenge of detecting localized, fine-grained deepfake edits that distract traditional detectors. It introduces an action-unit-guided spatio-temporal video representation learned through two self-supervised pretext tasks—masked frame reconstruction and action-unit map reconstruction—whose outputs are fused via cross-attention to form a robust latent embedding for real/fake classification. Trained on the FF++ dataset with pretraining on CelebV-HQ, the approach achieves a 20% improvement in detection accuracy over state-of-the-art methods for localized edits and shows strong generalization to standard deepfake datasets, as well as resilience to common perturbations. The results underscore the value of combining global frame dynamics with localized facial cues (AUs) for future-proof deepfake detection and suggest applicability to broader video analysis tasks.

Abstract

With rapid advancements in generative modeling, deepfake techniques are increasingly narrowing the gap between real and synthetic videos, raising serious privacy and security concerns. Beyond traditional face swapping and reenactment, an emerging trend in recent state-of-the-art deepfake generation methods involves localized edits such as subtle manipulations of specific facial features like raising eyebrows, altering eye shapes, or modifying mouth expressions. These fine-grained manipulations pose a significant challenge for existing detection models, which struggle to capture such localized variations. To the best of our knowledge, this work presents the first detection approach explicitly designed to generalize to localized edits in deepfake videos by leveraging spatiotemporal representations guided by facial action units. Our method leverages a cross-attention-based fusion of representations learned from pretext tasks like random masking and action unit detection, to create an embedding that effectively encodes subtle, localized changes. Comprehensive evaluations across multiple deepfake generation methods demonstrate that our approach, despite being trained solely on the traditional FF+ dataset, sets a new benchmark in detecting recent deepfake-generated videos with fine-grained local edits, achieving a $20\%$ improvement in accuracy over current state-of-the-art detection methods. Additionally, our method delivers competitive performance on standard datasets, highlighting its robustness and generalization across diverse types of local and global forgeries.

Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations

TL;DR

This paper tackles the challenge of detecting localized, fine-grained deepfake edits that distract traditional detectors. It introduces an action-unit-guided spatio-temporal video representation learned through two self-supervised pretext tasks—masked frame reconstruction and action-unit map reconstruction—whose outputs are fused via cross-attention to form a robust latent embedding for real/fake classification. Trained on the FF++ dataset with pretraining on CelebV-HQ, the approach achieves a 20% improvement in detection accuracy over state-of-the-art methods for localized edits and shows strong generalization to standard deepfake datasets, as well as resilience to common perturbations. The results underscore the value of combining global frame dynamics with localized facial cues (AUs) for future-proof deepfake detection and suggest applicability to broader video analysis tasks.

Abstract

With rapid advancements in generative modeling, deepfake techniques are increasingly narrowing the gap between real and synthetic videos, raising serious privacy and security concerns. Beyond traditional face swapping and reenactment, an emerging trend in recent state-of-the-art deepfake generation methods involves localized edits such as subtle manipulations of specific facial features like raising eyebrows, altering eye shapes, or modifying mouth expressions. These fine-grained manipulations pose a significant challenge for existing detection models, which struggle to capture such localized variations. To the best of our knowledge, this work presents the first detection approach explicitly designed to generalize to localized edits in deepfake videos by leveraging spatiotemporal representations guided by facial action units. Our method leverages a cross-attention-based fusion of representations learned from pretext tasks like random masking and action unit detection, to create an embedding that effectively encodes subtle, localized changes. Comprehensive evaluations across multiple deepfake generation methods demonstrate that our approach, despite being trained solely on the traditional FF+ dataset, sets a new benchmark in detecting recent deepfake-generated videos with fine-grained local edits, achieving a improvement in accuracy over current state-of-the-art detection methods. Additionally, our method delivers competitive performance on standard datasets, highlighting its robustness and generalization across diverse types of local and global forgeries.

Paper Structure

This paper contains 15 sections, 2 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Locally Edited Deepfakes Detection: A real video is manipulated to produce fake videos with subtle hard-to-detect edits - raised eyebrows, gender modification, expression change to disgust (single frame shown for illustration). Our method achieves significantly higher probability scores over top methods, effectively detecting these fine-grained edits with high confidence.
  • Figure 2: Proposed Method: The input video is processed using a face detection algorithm to extract equally spaced face-centered frames. These frames are divided into $N$ tubular patches, which are fed into a novel encoder, obtained by fusing latent representations obtained from pretrained pretext tasks, to generate latent vector $\mathbf{X}_E$. The encoded latent vector $\mathbf{X}_E$ is then passed through a classification head to detect the video as real or fake.
  • Figure 3: Pretext Tasks Training: Video-derived tubular tokens are first processed to form learnable embeddings. Some tokens are randomly masked, and the visible tokens are refined using an encoder. For training, the encoded latent representation is appended with placeholder tokens for masked positions (shown in gray) and passed through the decoder to reconstruct task-specific targets. For masking, the target is same as input face-centered frames and for AU detection, the target to be reconstructed is $16$ action unit maps per frame.
  • Figure 4: Visual detection comparison for locally manipulated videos: A real video is subjected to three types of localized manipulations, creating fake videos that are visually indistinguishable from the original. Individual frames for real and fake videos are shown for illustration. For each real sample, we tabulate the fake probability scores averaged for the three edited versions, comparing our method with state-of-the-art approaches. Despite the subtlety of the edits, our method successfully identifies the fake videos with high probability, unlike latest detection methods, demonstrating superior sensitivity to localized manipulations that are undetectable even to the naked eye.
  • Figure 5: Local Attention Visualization: Overlay of attention maps (bottom row) from the final cross-attention block in AUGFE on facial video frames (top row). The maps highlight key action units, demonstrating our model's ability to consistently capture critical facial features across diverse expressions.
  • ...and 3 more figures