Table of Contents
Fetching ...

Unified Physical-Digital Face Attack Detection

Hao Fang, Ajian Liu, Haocheng Yuan, Junze Zheng, Dingheng Zeng, Yanhong Liu, Jiankang Deng, Sergio Escalera, Xiaoming Liu, Jun Wan, Zhen Lei

TL;DR

This work tackles the problem of unified face attack detection by introducing UniAttackData, the first ID-consistent dataset that jointly covers physical and digital attacks with $1{,}800$ subjects and $29{,}706$ videos. It proposes UniAttackDetection, a CLIP-based framework that uses a three-module design—Teacher-Student Prompt, Unified Knowledge Mining, and Sample-Level Prompt Interaction—to learn a compact, complete feature space that spans live faces and both attack modalities. A unified feature-space objective (L_UFM) and multimodal prompt learning enable robust cross-attack and cross-dataset generalization, outperforming single-attack detectors on Protocols with seen and unseen attack types. The approach demonstrates strong performance on UniAttackData and other UAD datasets (FF++, JFSFDB, OULU-NPU), with ablations confirming the contribution of each component and visualization indicating clearer separation between live and attack clusters. Overall, the work offers practical implications for efficient, scalable FR security by unifying PAD and DAD under a single, language-informed vision-language framework.

Abstract

Face Recognition (FR) systems can suffer from physical (i.e., print photo) and digital (i.e., DeepFake) attacks. However, previous related work rarely considers both situations at the same time. This implies the deployment of multiple models and thus more computational burden. The main reasons for this lack of an integrated model are caused by two factors: (1) The lack of a dataset including both physical and digital attacks with ID consistency which means the same ID covers the real face and all attack types; (2) Given the large intra-class variance between these two attacks, it is difficult to learn a compact feature space to detect both attacks simultaneously. To address these issues, we collect a Unified physical-digital Attack dataset, called UniAttackData. The dataset consists of $1,800$ participations of 2 and 12 physical and digital attacks, respectively, resulting in a total of 29,706 videos. Then, we propose a Unified Attack Detection framework based on Vision-Language Models (VLMs), namely UniAttackDetection, which includes three main modules: the Teacher-Student Prompts (TSP) module, focused on acquiring unified and specific knowledge respectively; the Unified Knowledge Mining (UKM) module, designed to capture a comprehensive feature space; and the Sample-Level Prompt Interaction (SLPI) module, aimed at grasping sample-level semantics. These three modules seamlessly form a robust unified attack detection framework. Extensive experiments on UniAttackData and three other datasets demonstrate the superiority of our approach for unified face attack detection.

Unified Physical-Digital Face Attack Detection

TL;DR

This work tackles the problem of unified face attack detection by introducing UniAttackData, the first ID-consistent dataset that jointly covers physical and digital attacks with subjects and videos. It proposes UniAttackDetection, a CLIP-based framework that uses a three-module design—Teacher-Student Prompt, Unified Knowledge Mining, and Sample-Level Prompt Interaction—to learn a compact, complete feature space that spans live faces and both attack modalities. A unified feature-space objective (L_UFM) and multimodal prompt learning enable robust cross-attack and cross-dataset generalization, outperforming single-attack detectors on Protocols with seen and unseen attack types. The approach demonstrates strong performance on UniAttackData and other UAD datasets (FF++, JFSFDB, OULU-NPU), with ablations confirming the contribution of each component and visualization indicating clearer separation between live and attack clusters. Overall, the work offers practical implications for efficient, scalable FR security by unifying PAD and DAD under a single, language-informed vision-language framework.

Abstract

Face Recognition (FR) systems can suffer from physical (i.e., print photo) and digital (i.e., DeepFake) attacks. However, previous related work rarely considers both situations at the same time. This implies the deployment of multiple models and thus more computational burden. The main reasons for this lack of an integrated model are caused by two factors: (1) The lack of a dataset including both physical and digital attacks with ID consistency which means the same ID covers the real face and all attack types; (2) Given the large intra-class variance between these two attacks, it is difficult to learn a compact feature space to detect both attacks simultaneously. To address these issues, we collect a Unified physical-digital Attack dataset, called UniAttackData. The dataset consists of participations of 2 and 12 physical and digital attacks, respectively, resulting in a total of 29,706 videos. Then, we propose a Unified Attack Detection framework based on Vision-Language Models (VLMs), namely UniAttackDetection, which includes three main modules: the Teacher-Student Prompts (TSP) module, focused on acquiring unified and specific knowledge respectively; the Unified Knowledge Mining (UKM) module, designed to capture a comprehensive feature space; and the Sample-Level Prompt Interaction (SLPI) module, aimed at grasping sample-level semantics. These three modules seamlessly form a robust unified attack detection framework. Extensive experiments on UniAttackData and three other datasets demonstrate the superiority of our approach for unified face attack detection.
Paper Structure (27 sections, 5 equations, 8 figures, 6 tables)

This paper contains 27 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Paradigm Comparison. (a) Prior approaches necessitate the separate training and deployment of PAD and DAD models, demanding significant computational resources and inference time. (b) UAD dataset without ID consistency introduces the risk of the algorithm learning noise related to ID. (c) The base model encounters challenges in acquiring a compact feature space when confronted with the UAD dataset. (d) Our algorithm learns compact feature space and clear class boundaries.
  • Figure 2: UniAttackData Dataset examples of all attack types corresponding to the same face ID. From top to bottom, they are Africans, Central Asians, and East Asians, respectively. The attack type of each sample is marked at the top.
  • Figure 3: Our proposed UniAttackDetection architecture. The TSP module extracts unified and specific knowledge by constructing multiple groups of teacher prompts and learnable student prompts. The UKM module oversees the learning process by employing the unified knowledge mining loss, thereby enabling the model to acquire comprehensive insights across the entire feature space. The SLPI module maps the student prompts to the visual embedding space, allowing multi-modal prompt learning by making the student prompt learn sample-level semantics while allowing visual feature extraction to be guided by text.
  • Figure 4: Ablation experiments of the selection of teacher prompts. The horizontal coordinates indicate which teacher prompts from Tab \ref{['Table:ablation-2']} were selected. For example, T1$\sim$6 indicates the selection of the first six prompts from Tab \ref{['Table:ablation-2']}.
  • Figure 5: Feature distribution comparison on UAD data protocol (OULU-FF) and UniAttackData using t-SNE. Different colors denote features from different classes.
  • ...and 3 more figures