Distilled Transformers with Locally Enhanced Global Representations for Face Forgery Detection
Yaning Zhang, Qiufu Li, Zitong Yu, Linlin Shen
TL;DR
This work tackles face forgery detection by addressing the scarcity of soft supervision and the tendency of deep transformer-based models to collapse attention patterns when deep. The authors introduce DTN, a distilled transformer framework that combines soft tag generation via deepfake self-distillation, a Mixture of Experts for diverse local embeddings, and a Locally-Enhanced Vision Transformer with Multi-Attention Scaling to enrich global representations. Key contributions include STG/DSD for soft supervision, a plug-and-play MAS to prevent attention collapse, and an MoE+LEVT architecture that yields robust, generalizable forgery representations. Empirical results across five deepfake datasets show state-of-the-art performance and strong cross-dataset generalization, with extensive ablations validating each component’s impact and visualization supporting improved attention diversity.
Abstract
Face forgery detection (FFD) is devoted to detecting the authenticity of face images. Although current CNN-based works achieve outstanding performance in FFD, they are susceptible to capturing local forgery patterns generated by various manipulation methods. Though transformer-based detectors exhibit improvements in modeling global dependencies, they are not good at exploring local forgery artifacts. Hybrid transformer-based networks are designed to capture local and global manipulated traces, but they tend to suffer from the attention collapse issue as the transformer block goes deeper. Besides, soft labels are rarely available. In this paper, we propose a distilled transformer network (DTN) to capture both rich local and global forgery traces and learn general and common representations for different forgery faces. Specifically, we design a mixture of expert (MoE) module to mine various robust forgery embeddings. Moreover, a locally-enhanced vision transformer (LEVT) module is proposed to learn locally-enhanced global representations. We design a lightweight multi-attention scaling (MAS) module to avoid attention collapse, which can be plugged and played in any transformer-based models with only a slight increase in computational costs. In addition, we propose a deepfake self-distillation (DSD) scheme to provide the model with abundant soft label information. Extensive experiments show that the proposed method surpasses the state of the arts on five deepfake datasets.
