FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

Gaojian Wang; Feng Lin; Tong Wu; Zhenguang Liu; Zhongjie Ba; Kui Ren

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

Gaojian Wang, Feng Lin, Tong Wu, Zhenguang Liu, Zhongjie Ba, Kui Ren

TL;DR

FSFM tackles cross-dataset face security by learning a universal real-face representation from unlabeled data. It fuses masked image modeling and instance discrimination under a novel CRFR-P masking strategy, coupled with local-to-global self-distillation, to capture both fine-grained facial textures and holistic semantics. Pretraining on real faces with FSFM yields a ViT backbone that generalizes better than supervised or standard SSL approaches across deepfake detection, face anti-spoofing, and diffusion forgery tasks, often surpassing task-specific SOTA. The work demonstrates that task-agnostic, real-face representations can robustly detect both digital and physical manipulations, with practical implications for robust, scalable face security systems.

Abstract

This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region consistency and challenging inter-region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

TL;DR

Abstract

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)