Unified Physical-Digital Face Attack Detection
Hao Fang, Ajian Liu, Haocheng Yuan, Junze Zheng, Dingheng Zeng, Yanhong Liu, Jiankang Deng, Sergio Escalera, Xiaoming Liu, Jun Wan, Zhen Lei
TL;DR
This work tackles the problem of unified face attack detection by introducing UniAttackData, the first ID-consistent dataset that jointly covers physical and digital attacks with $1{,}800$ subjects and $29{,}706$ videos. It proposes UniAttackDetection, a CLIP-based framework that uses a three-module design—Teacher-Student Prompt, Unified Knowledge Mining, and Sample-Level Prompt Interaction—to learn a compact, complete feature space that spans live faces and both attack modalities. A unified feature-space objective (L_UFM) and multimodal prompt learning enable robust cross-attack and cross-dataset generalization, outperforming single-attack detectors on Protocols with seen and unseen attack types. The approach demonstrates strong performance on UniAttackData and other UAD datasets (FF++, JFSFDB, OULU-NPU), with ablations confirming the contribution of each component and visualization indicating clearer separation between live and attack clusters. Overall, the work offers practical implications for efficient, scalable FR security by unifying PAD and DAD under a single, language-informed vision-language framework.
Abstract
Face Recognition (FR) systems can suffer from physical (i.e., print photo) and digital (i.e., DeepFake) attacks. However, previous related work rarely considers both situations at the same time. This implies the deployment of multiple models and thus more computational burden. The main reasons for this lack of an integrated model are caused by two factors: (1) The lack of a dataset including both physical and digital attacks with ID consistency which means the same ID covers the real face and all attack types; (2) Given the large intra-class variance between these two attacks, it is difficult to learn a compact feature space to detect both attacks simultaneously. To address these issues, we collect a Unified physical-digital Attack dataset, called UniAttackData. The dataset consists of $1,800$ participations of 2 and 12 physical and digital attacks, respectively, resulting in a total of 29,706 videos. Then, we propose a Unified Attack Detection framework based on Vision-Language Models (VLMs), namely UniAttackDetection, which includes three main modules: the Teacher-Student Prompts (TSP) module, focused on acquiring unified and specific knowledge respectively; the Unified Knowledge Mining (UKM) module, designed to capture a comprehensive feature space; and the Sample-Level Prompt Interaction (SLPI) module, aimed at grasping sample-level semantics. These three modules seamlessly form a robust unified attack detection framework. Extensive experiments on UniAttackData and three other datasets demonstrate the superiority of our approach for unified face attack detection.
