Table of Contents
Fetching ...

Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces

Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou

TL;DR

Delocate tackles the challenge of detecting and localizing Deepfake videos with tampering traces located at random facial regions across unseen domains. It proposes a two-stage framework: a Recovering for Consistency Learning stage that pretrains a masked autoencoder on real faces with ROI-based masking to learn facial-part consistency, and a Localization for Discrepancy Learning stage that uses meta-learning and an encoder–decoder with a mapping module to detect and localize tampered regions by exploiting reconstruction discrepancies. The method jointly optimizes classification and localization losses under a meta-learning regime to enhance cross-domain generalization, achieving superior cross-domain detection and localization on multiple benchmarks while maintaining strong intra-domain performance. Overall, Delocate provides interpretable localization cues and robust detection for unknown-domain Deepfakes, advancing practical Deepfake forensic capabilities.

Abstract

Deepfake videos are becoming increasingly realistic, showing few tampering traces on facial areasthat vary between frames. Consequently, existing Deepfake detection methods struggle to detect unknown domain Deepfake videos while accurately locating the tampered region. To address thislimitation, we propose Delocate, a novel Deepfake detection model that can both recognize andlocalize unknown domain Deepfake videos. Ourmethod consists of two stages named recoveringand localization. In the recovering stage, the modelrandomly masks regions of interest (ROIs) and reconstructs real faces without tampering traces, leading to a relatively good recovery effect for realfaces and a poor recovery effect for fake faces. Inthe localization stage, the output of the recoveryphase and the forgery ground truth mask serve assupervision to guide the forgery localization process. This process strategically emphasizes the recovery phase of fake faces with poor recovery, facilitating the localization of tampered regions. Ourextensive experiments on four widely used benchmark datasets demonstrate that Delocate not onlyexcels in localizing tampered areas but also enhances cross-domain detection performance.

Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces

TL;DR

Delocate tackles the challenge of detecting and localizing Deepfake videos with tampering traces located at random facial regions across unseen domains. It proposes a two-stage framework: a Recovering for Consistency Learning stage that pretrains a masked autoencoder on real faces with ROI-based masking to learn facial-part consistency, and a Localization for Discrepancy Learning stage that uses meta-learning and an encoder–decoder with a mapping module to detect and localize tampered regions by exploiting reconstruction discrepancies. The method jointly optimizes classification and localization losses under a meta-learning regime to enhance cross-domain generalization, achieving superior cross-domain detection and localization on multiple benchmarks while maintaining strong intra-domain performance. Overall, Delocate provides interpretable localization cues and robust detection for unknown-domain Deepfakes, advancing practical Deepfake forensic capabilities.

Abstract

Deepfake videos are becoming increasingly realistic, showing few tampering traces on facial areasthat vary between frames. Consequently, existing Deepfake detection methods struggle to detect unknown domain Deepfake videos while accurately locating the tampered region. To address thislimitation, we propose Delocate, a novel Deepfake detection model that can both recognize andlocalize unknown domain Deepfake videos. Ourmethod consists of two stages named recoveringand localization. In the recovering stage, the modelrandomly masks regions of interest (ROIs) and reconstructs real faces without tampering traces, leading to a relatively good recovery effect for realfaces and a poor recovery effect for fake faces. Inthe localization stage, the output of the recoveryphase and the forgery ground truth mask serve assupervision to guide the forgery localization process. This process strategically emphasizes the recovery phase of fake faces with poor recovery, facilitating the localization of tampered regions. Ourextensive experiments on four widely used benchmark datasets demonstrate that Delocate not onlyexcels in localizing tampered areas but also enhances cross-domain detection performance.
Paper Structure (11 sections, 6 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 11 sections, 6 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Diferences between ours and previous methods. Previous CLS-Rec methods mainly emphasize classification while overlooking localization aspects. Previous CLS & Localize methods leverage real and fake labels for feature extraction, without initially modeling real samples to extract robust features. Our method integrates both classification and localization, with a dedicated focus on real samples, enabling us to extract features for enhanced performance.
  • Figure 2: Pipeline of the proposed Delocate. In the Recovering stage, Delocate learns unspecific features by developing the designed masking strategy and recovery process. In the Localization stage, Delocate leverages devised mapping module and encoder-decoder module to maximize the discrepancy between real videos and Deepfake videos and locate the forgery areas.
  • Figure 3: The significance of the randomly-located traces. Different forgery patterns employ different shapes to alter the face area, rendering random tampered traces across different frames, which cannot be predicted based on the current frame, resulting in strong unpredictability. (I) Face2Face in FF++. (II) FSGAN in DFDC (III) DeepFakes in FF++. (IV) Deepfake in Celeb-DF.
  • Figure 4: Comparisons of predicted forgery regions on CDF, DFo, and DFDC datasets when trained on $4$ types of videos of FF++.