Table of Contents
Fetching ...

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

Gaojian Wang, Feng Lin, Tong Wu, Zhenguang Liu, Zhongjie Ba, Kui Ren

TL;DR

FSFM tackles cross-dataset face security by learning a universal real-face representation from unlabeled data. It fuses masked image modeling and instance discrimination under a novel CRFR-P masking strategy, coupled with local-to-global self-distillation, to capture both fine-grained facial textures and holistic semantics. Pretraining on real faces with FSFM yields a ViT backbone that generalizes better than supervised or standard SSL approaches across deepfake detection, face anti-spoofing, and diffusion forgery tasks, often surpassing task-specific SOTA. The work demonstrates that task-agnostic, real-face representations can robustly detect both digital and physical manipulations, with practical implications for robust, scalable face security systems.

Abstract

This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region consistency and challenging inter-region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

TL;DR

FSFM tackles cross-dataset face security by learning a universal real-face representation from unlabeled data. It fuses masked image modeling and instance discrimination under a novel CRFR-P masking strategy, coupled with local-to-global self-distillation, to capture both fine-grained facial textures and holistic semantics. Pretraining on real faces with FSFM yields a ViT backbone that generalizes better than supervised or standard SSL approaches across deepfake detection, face anti-spoofing, and diffusion forgery tasks, often surpassing task-specific SOTA. The work demonstrates that task-agnostic, real-face representations can robustly detect both digital and physical manipulations, with practical implications for robust, scalable face security systems.

Abstract

This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region consistency and challenging inter-region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.

Paper Structure

This paper contains 35 sections, 5 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of FSFM self-supervised pretraining framework for learning foundational representations of real faces (3C). Guided by the CRFR-P masking strategy, the masked image modeling (MIM) network captures intra-region Consistency with $\mathcal{L}_\mathit{rec}^\mathit{m}$ and enforces inter-region Coherency via $\mathcal{L}_\mathit{rec}^\mathit{fr}$, while the instance discrimination (ID) network collaborates to promote local-to-global Correspondence through $\mathcal{L}_\mathit{sim}$. After pretraining, the online encoder $E_\mathit{o}$ (a vanilla ViT ) is applied to boost downstream face security tasks.
  • Figure 2: Comparison of masking strategies for face images (75% masking ratio). (a) Random masking. (b) Fasking-I, adapted from cai2023marlin, priority masking regions $\notin${bg, skin}. (c) Our FRP: Proportional masking within each Facial Region $\in${$\mathit{FR}$}. (d) Our CRFR-R: Covering a Random Facial Region $\in${$\mathit{fr}$} and then Random masking other patches. (e) Our CRFR-P: Covering a Random Facial Region $\in${$\mathit{fr}$} and then Proportional masking other regions $\in${$\mathit{FR}-\mathit{fr}$}. All masks are binary (black solely highlights $\mathit{fr}$).
  • Figure 3: Comparison of different target views. (a) Visible patches from a different mask. (b) Masked patches from the same mask. (c) Full patches without masking.
  • Figure 4: CAM Visualization. (a) DfD on various manipulations from FF++ rossler2019faceforensics++. (b) FAS on the MCIO protocol. FSFM highlights forgery artifacts and spoofing clues. Images are from the test set.
  • Figure 5: Additional visualizations of different facial masking strategies. (a) Random masking he2022masked. (b) Fasking-I adapted from cai2023marlin. (c) FRP: Facial Region Proportional masking. (d) CRFR-R: Covering a Random Facial Region followed by Random masking. (e) CRFR-P: Covering a Random Facial Region followed by Proportional masking.
  • ...and 6 more figures