Table of Contents
Fetching ...

Distilled Transformers with Locally Enhanced Global Representations for Face Forgery Detection

Yaning Zhang, Qiufu Li, Zitong Yu, Linlin Shen

TL;DR

This work tackles face forgery detection by addressing the scarcity of soft supervision and the tendency of deep transformer-based models to collapse attention patterns when deep. The authors introduce DTN, a distilled transformer framework that combines soft tag generation via deepfake self-distillation, a Mixture of Experts for diverse local embeddings, and a Locally-Enhanced Vision Transformer with Multi-Attention Scaling to enrich global representations. Key contributions include STG/DSD for soft supervision, a plug-and-play MAS to prevent attention collapse, and an MoE+LEVT architecture that yields robust, generalizable forgery representations. Empirical results across five deepfake datasets show state-of-the-art performance and strong cross-dataset generalization, with extensive ablations validating each component’s impact and visualization supporting improved attention diversity.

Abstract

Face forgery detection (FFD) is devoted to detecting the authenticity of face images. Although current CNN-based works achieve outstanding performance in FFD, they are susceptible to capturing local forgery patterns generated by various manipulation methods. Though transformer-based detectors exhibit improvements in modeling global dependencies, they are not good at exploring local forgery artifacts. Hybrid transformer-based networks are designed to capture local and global manipulated traces, but they tend to suffer from the attention collapse issue as the transformer block goes deeper. Besides, soft labels are rarely available. In this paper, we propose a distilled transformer network (DTN) to capture both rich local and global forgery traces and learn general and common representations for different forgery faces. Specifically, we design a mixture of expert (MoE) module to mine various robust forgery embeddings. Moreover, a locally-enhanced vision transformer (LEVT) module is proposed to learn locally-enhanced global representations. We design a lightweight multi-attention scaling (MAS) module to avoid attention collapse, which can be plugged and played in any transformer-based models with only a slight increase in computational costs. In addition, we propose a deepfake self-distillation (DSD) scheme to provide the model with abundant soft label information. Extensive experiments show that the proposed method surpasses the state of the arts on five deepfake datasets.

Distilled Transformers with Locally Enhanced Global Representations for Face Forgery Detection

TL;DR

This work tackles face forgery detection by addressing the scarcity of soft supervision and the tendency of deep transformer-based models to collapse attention patterns when deep. The authors introduce DTN, a distilled transformer framework that combines soft tag generation via deepfake self-distillation, a Mixture of Experts for diverse local embeddings, and a Locally-Enhanced Vision Transformer with Multi-Attention Scaling to enrich global representations. Key contributions include STG/DSD for soft supervision, a plug-and-play MAS to prevent attention collapse, and an MoE+LEVT architecture that yields robust, generalizable forgery representations. Empirical results across five deepfake datasets show state-of-the-art performance and strong cross-dataset generalization, with extensive ablations validating each component’s impact and visualization supporting improved attention diversity.

Abstract

Face forgery detection (FFD) is devoted to detecting the authenticity of face images. Although current CNN-based works achieve outstanding performance in FFD, they are susceptible to capturing local forgery patterns generated by various manipulation methods. Though transformer-based detectors exhibit improvements in modeling global dependencies, they are not good at exploring local forgery artifacts. Hybrid transformer-based networks are designed to capture local and global manipulated traces, but they tend to suffer from the attention collapse issue as the transformer block goes deeper. Besides, soft labels are rarely available. In this paper, we propose a distilled transformer network (DTN) to capture both rich local and global forgery traces and learn general and common representations for different forgery faces. Specifically, we design a mixture of expert (MoE) module to mine various robust forgery embeddings. Moreover, a locally-enhanced vision transformer (LEVT) module is proposed to learn locally-enhanced global representations. We design a lightweight multi-attention scaling (MAS) module to avoid attention collapse, which can be plugged and played in any transformer-based models with only a slight increase in computational costs. In addition, we propose a deepfake self-distillation (DSD) scheme to provide the model with abundant soft label information. Extensive experiments show that the proposed method surpasses the state of the arts on five deepfake datasets.
Paper Structure (19 sections, 21 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 19 sections, 21 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) Schematic illustration of face image manipulation labels (real or fake). (b) The heatmap visualization of various deepfake detectors. (c) The visualization of attention maps across various heads in different transformer blocks.
  • Figure 2: (a) Traditional hybrid transformer, i.e. Convolutional vision transformer (CViT). (b) The proposed distilled transformer network (DTN). The heatmap of CViT and DTN for detecting faces in shifted domain attacks (saturation and block-wise are selected from seven types of image corruptions in Jiang2020DeeperForensics). CViT struggles to mine the consistent forgery artifacts while DTN is capable of capturing the consistent and comprehensive forgery patterns.
  • Figure 3: An overview of the proposed DTN framework. We utilize the DSD scheme to capture general forgery traces, where we first pre-train a DTN model with labeled training images as the teacher model, to guide the student model learning, and one generation of the student serves as the teacher in the next one until no improvements are observed. The DTN model encodes high-level semantic embeddings from input facial images through a backbone VGG, which are then fed into the MoE module to analyze various robust embeddings. We then transfer them to the LEVT module to model the locally enhanced global relations among image patches, where MAS further mines rich facial forgery patterns via flexibly choosing attention maps. Finally, the classifier yields predictions.
  • Figure 4: The workflow of the scaling multi-head self-attention (S-MHSA) module.
  • Figure 5: Robustness to unseen image deformations. DSD denotes deepfake self-distillation
  • ...and 5 more figures