FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection
Yongze Li, Ning Li, Ajian Liu, Hui Ma, Liying Yang, Xihong Chen, Zhiyao Liang, Yanyan Liang, Jun Wan, Zhen Lei
TL;DR
This work tackles the challenge of unified face attack detection by introducing FA3-CLIP, which combines attack-agnostic prompts in the language branch with a dual-stream cues fusion in the vision branch to fuse spatial and frequency cues. The method includes frequency features generation, multi-layer frequency aggregation, and a frequency compression block, all trained under a joint objective that includes both a normalized temperature-scaled cross-entropy loss and standard cross-entropy. Key contributions are the attack-agnostic prompt learning framework, the dual-stream fusion that leverages multi-layer frequency information, and rigorously designed protocols that ensure strict ID separation and fair evaluation across attack types, achieving state-of-the-art results on UniAttackData and competitive results on JFSFDB. The findings demonstrate the value of incorporating frequency cues with textual prompts to improve generalization across physical and digital attacks in real-world security scenarios.
Abstract
Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.
