Table of Contents
Fetching ...

Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering

Jiabao Guo, Ajian Liu, Yunfeng Diao, Jin Zhang, Hui Ma, Bo Zhao, Richang Hong, Meng Wang

TL;DR

Domain Generalization in Face Anti-Spoofing is tackled by CCPE, which replaces semantics-poor class prompts with instance-aware prompts derived from an instruction-based LLM and a learnable Q-Former branch, coupled with a Cross-modal Guidance Module to fuse language and vision. This approach yields state-of-the-art generalization on cross-domain FAS benchmarks and is supported by ablations showing the value of each component. The method demonstrates that content-aware prompts and multimodal guidance can mitigate domain shifts without target-domain data. The work offers a practical, scalable pathway for robust FAS in real-world, diverse capture settings by leveraging rich semantic information from LLMs and adaptable visual prompts.

Abstract

The challenge of Domain Generalization (DG) in Face Anti-Spoofing (FAS) is the significant interference of domain-specific signals on subtle spoofing clues. Recently, some CLIP-based algorithms have been developed to alleviate this interference by adjusting the weights of visual classifiers. However, our analysis of this class-wise prompt engineering suffers from two shortcomings for DG FAS: (1) The categories of facial categories, such as real or spoof, have no semantics for the CLIP model, making it difficult to learn accurate category descriptions. (2) A single form of prompt cannot portray the various types of spoofing. In this work, instead of class-wise prompts, we propose a novel Content-aware Composite Prompt Engineering (CCPE) that generates instance-wise composite prompts, including both fixed template and learnable prompts. Specifically, our CCPE constructs content-aware prompts from two branches: (1) Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction-based Large Language Model (LLM). (2) Learnable content prompts implicitly extract the most informative visual content via Q-Former. Moreover, we design a Cross-Modal Guidance Module (CGM) that dynamically adjusts unimodal features for fusion to achieve better generalized FAS. Finally, our CCPE has been validated for its effectiveness in multiple cross-domain experiments and achieves state-of-the-art (SOTA) results.

Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering

TL;DR

Domain Generalization in Face Anti-Spoofing is tackled by CCPE, which replaces semantics-poor class prompts with instance-aware prompts derived from an instruction-based LLM and a learnable Q-Former branch, coupled with a Cross-modal Guidance Module to fuse language and vision. This approach yields state-of-the-art generalization on cross-domain FAS benchmarks and is supported by ablations showing the value of each component. The method demonstrates that content-aware prompts and multimodal guidance can mitigate domain shifts without target-domain data. The work offers a practical, scalable pathway for robust FAS in real-world, diverse capture settings by leveraging rich semantic information from LLMs and adaptable visual prompts.

Abstract

The challenge of Domain Generalization (DG) in Face Anti-Spoofing (FAS) is the significant interference of domain-specific signals on subtle spoofing clues. Recently, some CLIP-based algorithms have been developed to alleviate this interference by adjusting the weights of visual classifiers. However, our analysis of this class-wise prompt engineering suffers from two shortcomings for DG FAS: (1) The categories of facial categories, such as real or spoof, have no semantics for the CLIP model, making it difficult to learn accurate category descriptions. (2) A single form of prompt cannot portray the various types of spoofing. In this work, instead of class-wise prompts, we propose a novel Content-aware Composite Prompt Engineering (CCPE) that generates instance-wise composite prompts, including both fixed template and learnable prompts. Specifically, our CCPE constructs content-aware prompts from two branches: (1) Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction-based Large Language Model (LLM). (2) Learnable content prompts implicitly extract the most informative visual content via Q-Former. Moreover, we design a Cross-Modal Guidance Module (CGM) that dynamically adjusts unimodal features for fusion to achieve better generalized FAS. Finally, our CCPE has been validated for its effectiveness in multiple cross-domain experiments and achieves state-of-the-art (SOTA) results.

Paper Structure

This paper contains 16 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison with existing prompt engineering paradigms for DG FAS. (a) Methods based on template prompts, like FLIP srivatsan2023flip, require professional task-specific knowledge to manually design category descriptions. (b) Methods based on prompt learning, like CoOp zhou2022coop, cannot generate accurate category descriptions due to a lack of understanding of their semantics. (c) The proposed CCPE addresses these limitations by constructing both explicit and implicit composite content prompts, which benefit from abundant transferred knowledge by instruction-based LLM and the most informative visual content via Q-former.
  • Figure 2: Overall architecture of our proposed Content-aware Composite Prompt Engineering (CCPE) framework for DG FAS. Our CCPE is built on CLIP and realizes adaption to FAS tasks by leveraging prompt engineering with two main contributions: (1) Content-aware Composite Prompt Engineering. CCPE generates instance-wise composite prompts, including both inherent content prompts and learnable content prompts. Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction-based LLM model. Learnable content prompts implicitly extract the most informative visual content via Q-Former. (2) Cross-modal Guidance Module (CGM). CGM encompasses composite language fusion and vision-language modality fusion. This multimodal-based fusion approach is usually better at capturing different aspects of underlying concepts and dynamically adjusting the usage of unimodal features for better generalization FAS.
  • Figure 3: The proposed Cross-modal Guidance Module (CGM). CGM includes two components: composite language fusion and vision-language modality fusion.
  • Figure 4: UMAP umap2018 visualization for the feature learned from the penultimate layer of the proposed CCPE method in the cross-dataset FAS task of Protocol 1. The dotted line in each visualization represents the decision boundary derived from the training samples in the 2D space. The consistent separation of live and spoof samples across different domain combinations highlights the ability of CCPE to learn domain-invariant features for FAS.
  • Figure 5: Visualization of attention maps on images from different scenarios in Protocol 1. Classification errors arose in the baseline model due to its shortcomings in detecting spoofing regions. Our CCPE accurately adjusts the region of interest by excluding domain-related interfering information and excels in correctly classifying these samples.