Table of Contents
Fetching ...

UniForensics: Face Forgery Detection via General Facial Representation

Ziyuan Fang, Hanqing Zhao, Tianyi Wei, Wenbo Zhou, Ming Wan, Zhanyi Wang, Weiming Zhang, Nenghai Yu

TL;DR

The paper addresses the challenge of generalizing face forgery detection to unseen manipulation methods by leveraging high-level semantic facial representations and temporal cues. It introduces UniForensics, a transformer-based video detector initialized with FaRL's meta-functional face encoder, and DVSB to synthesize temporally diverse fake samples from real videos. A two-stage training pipeline—self-supervised forgery-process contrastive pretraining followed by supervised finetuning—yields strong cross-dataset performance, with Celeb-DFv2 and DFDC AUCs reaching $95.3\%$ and $77.2\%$, respectively, and robust performance under common corruptions. The approach demonstrates that combining semantic-rich facial features with spatio-temporal modeling and carefully designed data synthesis substantially enhances generalization and practicality for real-world deepfake detection.

Abstract

Previous deepfake detection methods mostly depend on low-level textural features vulnerable to perturbations and fall short of detecting unseen forgery methods. In contrast, high-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. Motivated by this, we propose a detection method that utilizes high-level semantic features of faces to identify inconsistencies in temporal domain. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video classification network, initialized with a meta-functional face encoder for enriched facial representation. In this way, we can take advantage of both the powerful spatio-temporal model and the high-level semantic information of faces. Furthermore, to leverage easily accessible real face data and guide the model in focusing on spatio-temporal features, we design a Dynamic Video Self-Blending (DVSB) method to efficiently generate training samples with diverse spatio-temporal forgery traces using real facial videos. Based on this, we advance our framework with a two-stage training approach: The first stage employs a novel self-supervised contrastive learning, where we encourage the network to focus on forgery traces by impelling videos generated by the same forgery process to have similar representations. On the basis of the representation learned in the first stage, the second stage involves fine-tuning on face forgery detection dataset to build a deepfake detector. Extensive experiments validates that UniForensics outperforms existing face forgery methods in generalization ability and robustness. In particular, our method achieves 95.3\% and 77.2\% cross dataset AUC on the challenging Celeb-DFv2 and DFDC respectively.

UniForensics: Face Forgery Detection via General Facial Representation

TL;DR

The paper addresses the challenge of generalizing face forgery detection to unseen manipulation methods by leveraging high-level semantic facial representations and temporal cues. It introduces UniForensics, a transformer-based video detector initialized with FaRL's meta-functional face encoder, and DVSB to synthesize temporally diverse fake samples from real videos. A two-stage training pipeline—self-supervised forgery-process contrastive pretraining followed by supervised finetuning—yields strong cross-dataset performance, with Celeb-DFv2 and DFDC AUCs reaching and , respectively, and robust performance under common corruptions. The approach demonstrates that combining semantic-rich facial features with spatio-temporal modeling and carefully designed data synthesis substantially enhances generalization and practicality for real-world deepfake detection.

Abstract

Previous deepfake detection methods mostly depend on low-level textural features vulnerable to perturbations and fall short of detecting unseen forgery methods. In contrast, high-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. Motivated by this, we propose a detection method that utilizes high-level semantic features of faces to identify inconsistencies in temporal domain. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video classification network, initialized with a meta-functional face encoder for enriched facial representation. In this way, we can take advantage of both the powerful spatio-temporal model and the high-level semantic information of faces. Furthermore, to leverage easily accessible real face data and guide the model in focusing on spatio-temporal features, we design a Dynamic Video Self-Blending (DVSB) method to efficiently generate training samples with diverse spatio-temporal forgery traces using real facial videos. Based on this, we advance our framework with a two-stage training approach: The first stage employs a novel self-supervised contrastive learning, where we encourage the network to focus on forgery traces by impelling videos generated by the same forgery process to have similar representations. On the basis of the representation learned in the first stage, the second stage involves fine-tuning on face forgery detection dataset to build a deepfake detector. Extensive experiments validates that UniForensics outperforms existing face forgery methods in generalization ability and robustness. In particular, our method achieves 95.3\% and 77.2\% cross dataset AUC on the challenging Celeb-DFv2 and DFDC respectively.
Paper Structure (27 sections, 11 equations, 5 figures, 9 tables)

This paper contains 27 sections, 11 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Overview of our proposed method. We use a pretrained meta-functional face encoder to initialize a transformer-based video classification network as detection backbone and advance it with our two-stage training strategy. The first stage employs self-supervised contrastive learning on fake samples generated from real face data. The second stage involves fine-tuning on deepfake detection dataset.
  • Figure 2: Dynamic Video Self-Blending: $K$ is a group of parameters controls realization of transforms to the target image and blending policies. We firstly generate a random sequence of $K$ and then use a temporal filter to make the transformation $K$ of each frame smooth in time. Consequently, the synthesized fake video has non-uniform temporal artifacts.
  • Figure 3: Robustness evaluation. We report the AUC (%) scores of our methods under five different levels of seven particular types of corruption. ”Average” denotes the mean across all corruptions at each severity level.
  • Figure 4: Examples of different temporal manipulations on the same video clip. Real: real clip without manipulations. Static: apply same transformation to each frame in one clip, Independent: randomly apply independent transformation to each frame, Dynamic: the proposed dynamic video self-blending(DVSB).
  • Figure 5: Attention maps of global blocks. We choose two fake video clips and show the attention maps of global block 1 to 6.