Table of Contents
Fetching ...

FaceCat: Enhancing Face Recognition Security with a Unified Diffusion Model

Jiawei Chen, Xiao Yang, Yinpeng Dong, Hang Su, Zhaoxia Yin

TL;DR

FaceCat addresses the practical need to unify face anti-spoofing (FAS) and adversarial detection (FAD) by leveraging a face diffusion model as a rich feature initializer. It introduces a hierarchical fusion mechanism to aggregate multi-level diffusion features and a text-guided multi-modal alignment to enrich semantic representation, complemented by a triplet-margin objective to separate real from fake/adversarial samples. A new FaceCatData dataset with 28 attack types supports robust evaluation of the unified framework. Experimental results demonstrate improved accuracy and robustness under common input transformations, indicating strong potential for real-world deployment of unified facial security systems.

Abstract

Face anti-spoofing (FAS) and adversarial detection (FAD) have been regarded as critical technologies to ensure the safety of face recognition systems. However, due to limited practicality, complex deployment, and the additional computational overhead, it is necessary to implement both detection techniques within a unified framework. This paper aims to achieve this goal by breaking through two primary obstacles: 1) the suboptimal face feature representation and 2) the scarcity of training data. To address the limited performance caused by existing feature representations, motivated by the rich structural and detailed features of face diffusion models, we propose FaceCat, the first approach leveraging the diffusion model to simultaneously enhance the performance of FAS and FAD. Specifically, FaceCat elaborately designs a hierarchical fusion mechanism to capture rich face semantic features of the diffusion model. These features then serve as a robust foundation for a lightweight head, designed to execute FAS and FAD simultaneously. Due to the limitations in feature representation that arise from relying solely on single-modality image data, we further propose a novel text-guided multi-modal alignment strategy that utilizes text prompts to enrich feature representation, thereby enhancing performance. To combat data scarcity, we build a comprehensive dataset with a wide range of 28 attack types, offering greater potential for a unified framework in facial security. Extensive experiments validate the effectiveness of FaceCat generalizes significantly better and obtains excellent robustness against common input transformations.

FaceCat: Enhancing Face Recognition Security with a Unified Diffusion Model

TL;DR

FaceCat addresses the practical need to unify face anti-spoofing (FAS) and adversarial detection (FAD) by leveraging a face diffusion model as a rich feature initializer. It introduces a hierarchical fusion mechanism to aggregate multi-level diffusion features and a text-guided multi-modal alignment to enrich semantic representation, complemented by a triplet-margin objective to separate real from fake/adversarial samples. A new FaceCatData dataset with 28 attack types supports robust evaluation of the unified framework. Experimental results demonstrate improved accuracy and robustness under common input transformations, indicating strong potential for real-world deployment of unified facial security systems.

Abstract

Face anti-spoofing (FAS) and adversarial detection (FAD) have been regarded as critical technologies to ensure the safety of face recognition systems. However, due to limited practicality, complex deployment, and the additional computational overhead, it is necessary to implement both detection techniques within a unified framework. This paper aims to achieve this goal by breaking through two primary obstacles: 1) the suboptimal face feature representation and 2) the scarcity of training data. To address the limited performance caused by existing feature representations, motivated by the rich structural and detailed features of face diffusion models, we propose FaceCat, the first approach leveraging the diffusion model to simultaneously enhance the performance of FAS and FAD. Specifically, FaceCat elaborately designs a hierarchical fusion mechanism to capture rich face semantic features of the diffusion model. These features then serve as a robust foundation for a lightweight head, designed to execute FAS and FAD simultaneously. Due to the limitations in feature representation that arise from relying solely on single-modality image data, we further propose a novel text-guided multi-modal alignment strategy that utilizes text prompts to enrich feature representation, thereby enhancing performance. To combat data scarcity, we build a comprehensive dataset with a wide range of 28 attack types, offering greater potential for a unified framework in facial security. Extensive experiments validate the effectiveness of FaceCat generalizes significantly better and obtains excellent robustness against common input transformations.
Paper Structure (15 sections, 11 equations, 6 figures, 6 tables)

This paper contains 15 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of our method and face generative models. (a) The face generation only models. (b) Our FaceCat exploits the abundant features inherent in face generative models to serve face anti-spoofing and adversarial detection simultaneously.
  • Figure 2: An overview of our proposed FaceCat framework. FaceCat includes a generative model $\epsilon(\bm{x}_t,t)$ to encode the noisy image $\bm{x}_t$ into the face features, a lightweight head $\mathcal{H}^f$ to extract image embeddings $e$ from these face features, and a text encoder to obtain text embeddings from text prompts. Through image embeddings and text embeddings $\{\bm{\omega}\}_{i=1}^K$, the multi-modal alignment strategy calculates a text-image similarity score treated as the logit. The triplet-based margin is utilized to facilitate the learning of features.
  • Figure 3: Multi-block feature representations are derived from the face diffusion model across different channels.
  • Figure 4: The adversarial examples with different attacks.
  • Figure 5: The area under curve (%) between FaceCat and the baseline against common input transformations.
  • ...and 1 more figures