Table of Contents
Fetching ...

FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection

Yongze Li, Ning Li, Ajian Liu, Hui Ma, Liying Yang, Xihong Chen, Zhiyao Liang, Yanyan Liang, Jun Wan, Zhen Lei

TL;DR

This work tackles the challenge of unified face attack detection by introducing FA3-CLIP, which combines attack-agnostic prompts in the language branch with a dual-stream cues fusion in the vision branch to fuse spatial and frequency cues. The method includes frequency features generation, multi-layer frequency aggregation, and a frequency compression block, all trained under a joint objective that includes both a normalized temperature-scaled cross-entropy loss and standard cross-entropy. Key contributions are the attack-agnostic prompt learning framework, the dual-stream fusion that leverages multi-layer frequency information, and rigorously designed protocols that ensure strict ID separation and fair evaluation across attack types, achieving state-of-the-art results on UniAttackData and competitive results on JFSFDB. The findings demonstrate the value of incorporating frequency cues with textual prompts to improve generalization across physical and digital attacks in real-world security scenarios.

Abstract

Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.

FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection

TL;DR

This work tackles the challenge of unified face attack detection by introducing FA3-CLIP, which combines attack-agnostic prompts in the language branch with a dual-stream cues fusion in the vision branch to fuse spatial and frequency cues. The method includes frequency features generation, multi-layer frequency aggregation, and a frequency compression block, all trained under a joint objective that includes both a normalized temperature-scaled cross-entropy loss and standard cross-entropy. Key contributions are the attack-agnostic prompt learning framework, the dual-stream fusion that leverages multi-layer frequency information, and rigorously designed protocols that ensure strict ID separation and fair evaluation across attack types, achieving state-of-the-art results on UniAttackData and competitive results on JFSFDB. The findings demonstrate the value of incorporating frequency cues with textual prompts to improve generalization across physical and digital attacks in real-world security scenarios.

Abstract

Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.

Paper Structure

This paper contains 22 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The figure illustrates the frequency differences among live faces, physical attacks (PAs), and digital attacks (DAs) in the UniAttackData fang2024unified. The frequency density histograms are computed from Fourier Transform maps, averaging over 1,000 samples per category. Notably, higher frequency corresponds to a lower value, and the blue frame highlights the significant frequency differences among the three categories. To further examine these differences, we visualize the frequency difference maps, in which higher frequency components are concentrated at the periphery, and lighter colors indicate more minor differences between the compared categories. The results demonstrate substantial variations in frequency information across the three categories, suggesting that frequency features can be recognized as valuable indicators for unified attack detection.
  • Figure 2: The architecture of the attack-agnostic prompt learning and dual-stream cues fusion framework in FA3-CLIP. The frequency generators $\mathcal{H}_{Ori}^{V}$ and $\mathcal{H}_{Ori}^{L}$ are introduced to extract the frequency information from the original image in vision and language branches. In vision branch, the frequency feature at each vision transformer layer is denoted by $\boldsymbol{f}_j=\mathcal{H}_j(\mathcal{V}_j(\boldsymbol{Z}_j))$ where $\mathcal{V}_j(\cdot)$ denotes the $j$-th vision transformer layer, the $\boldsymbol{Z}_j$ represents its corresponding input tokens. Then the multi-layer frequency features are compressed through the frequency compression block (FCB) and integrated with visual features. In language branch, the bias generators $\mathcal{B}_i(\cdot)$ are employed to optimize the generic live and fake prompts. Additionally, FA3-CLIP incorporates constraints based on both normalized temperature-scaled cross-entropy $\mathcal{L}_{nt}$ and standard cross-entropy $\mathcal{L}_{ce}$.
  • Figure 3: Feature similarity among different attack types (Physical, Advanced, and Deepfake), derived from similarity scores generated by the pre-trained ViT, form the basis for designing more reasonable and challenging protocols.
  • Figure 4: Ablation results on UniAttackData fang2024unified with Protocol 1. $\downarrow/\uparrow$ indicate that smaller/larger values correspond to better performance. The value of the context length performs well in 6.
  • Figure 5: UMAP mcinnes2018umap visualization of the feature representations learned by baseline CLIP and our model on both UniAttackData fang2024unified and JFSFDB yu2024benchmarking. Compared to CLIP, our method yields more distinctly separated clusters for live faces and fake faces.
  • ...and 1 more figures