Table of Contents
Fetching ...

Face Forgery Detection with Elaborate Backbone

Zonghui Guo, Yingjie Liu, Jie Zhang, Haiyong Zheng, Shiguang Shan

TL;DR

A competitive backbone fine-tuning framework is built that strengthens the backbone's ability to extract diverse forgery cues within a competitive learning mechanism and a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability is devised.

Abstract

Face Forgery Detection (FFD), or Deepfake detection, aims to determine whether a digital face is real or fake. Due to different face synthesis algorithms with diverse forgery patterns, FFD models often overfit specific patterns in training datasets, resulting in poor generalization to other unseen forgeries. This severe challenge requires FFD models to possess strong capabilities in representing complex facial features and extracting subtle forgery cues. Although previous FFD models directly employ existing backbones to represent and extract facial forgery cues, the critical role of backbones is often overlooked, particularly as their knowledge and capabilities are insufficient to address FFD challenges, inevitably limiting generalization. Therefore, it is essential to integrate the backbone pre-training configurations and seek practical solutions by revisiting the complete FFD workflow, from backbone pre-training and fine-tuning to inference of discriminant results. Specifically, we analyze the crucial contributions of backbones with different configurations in FFD task and propose leveraging the ViT network with self-supervised learning on real-face datasets to pre-train a backbone, equipping it with superior facial representation capabilities. We then build a competitive backbone fine-tuning framework that strengthens the backbone's ability to extract diverse forgery cues within a competitive learning mechanism. Moreover, we devise a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability. Comprehensive experiments demonstrate that our FFD model with the elaborate backbone achieves excellent performance in FFD and extra face-related tasks, i.e., presentation attack detection. Code and models are available at https://github.com/zhenglab/FFDBackbone.

Face Forgery Detection with Elaborate Backbone

TL;DR

A competitive backbone fine-tuning framework is built that strengthens the backbone's ability to extract diverse forgery cues within a competitive learning mechanism and a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability is devised.

Abstract

Face Forgery Detection (FFD), or Deepfake detection, aims to determine whether a digital face is real or fake. Due to different face synthesis algorithms with diverse forgery patterns, FFD models often overfit specific patterns in training datasets, resulting in poor generalization to other unseen forgeries. This severe challenge requires FFD models to possess strong capabilities in representing complex facial features and extracting subtle forgery cues. Although previous FFD models directly employ existing backbones to represent and extract facial forgery cues, the critical role of backbones is often overlooked, particularly as their knowledge and capabilities are insufficient to address FFD challenges, inevitably limiting generalization. Therefore, it is essential to integrate the backbone pre-training configurations and seek practical solutions by revisiting the complete FFD workflow, from backbone pre-training and fine-tuning to inference of discriminant results. Specifically, we analyze the crucial contributions of backbones with different configurations in FFD task and propose leveraging the ViT network with self-supervised learning on real-face datasets to pre-train a backbone, equipping it with superior facial representation capabilities. We then build a competitive backbone fine-tuning framework that strengthens the backbone's ability to extract diverse forgery cues within a competitive learning mechanism. Moreover, we devise a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability. Comprehensive experiments demonstrate that our FFD model with the elaborate backbone achieves excellent performance in FFD and extra face-related tasks, i.e., presentation attack detection. Code and models are available at https://github.com/zhenglab/FFDBackbone.
Paper Structure (22 sections, 5 equations, 9 figures, 9 tables)

This paper contains 22 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Quantitative comparison of FFD methods with various backbones. "Unpre-trained" means networks with randomly initialized parameters, while "SL-1K" denotes those pre-trained with supervised learning on ImageNet-1K. The average AUC across three cross-datasets (Celeb-DF, DFDC, and FFIW) better reflects the generalization of FFD models.
  • Figure 2: Representative forged face examples with forgery cues primarily in facial components such as eyebrows, nose, eyes, and lips.
  • Figure 3: Overview of our research to revisit the complete FFD workflow from backbone pre-training and fine-tuning to inference in discriminant results. This includes network architectures, datasets, learning approaches with supervised learning (SL) and self-supervised learning (SSL), which encompass contrastive learning (CL) and masked image modeling (MIM) pretext tasks, as well as our competitive backbone fine-tuning framework and threshold optimization mechanism.
  • Figure 4: FFD workflow evolves from the traditional pipeline to our revitalized FFD pipeline. Existing FFD methods primarily rely on backbones pre-trained with supervised learning on ImageNet, applying various techniques during fine-tuning, and using empirical classification thresholds during inference. In contrast, our FFD pipeline offers a more proficient, promising, and reliable solution by incorporating self-supervised learning on real faces, a competitive backbone framework, and an uncertainty-based threshold optimization mechanism across the three stages.
  • Figure 5: Our uncertainty-based fusion module, calculates uncertainty using features from the main (M) and auxiliary (A) branches, then inputs these into Softmax to obtain weights for fusing the features as output.
  • ...and 4 more figures