Table of Contents
Fetching ...

CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing

Ajian Liu, Shuai Xue, Jianwen Gan, Jun Wan, Yanyan Liang, Jiankang Deng, Sergio Escalera, Zhen Lei

TL;DR

The paper tackles domain generalization in face anti-spoofing by moving away from domain labels and disentangled representations, replacing them with class-free prompt learning powered by Vision-Language Models. It introduces CFPL, a framework built on CLIP that uses Content Q-Former (CQF) and Style Q-Former (SQF) to generate content- and style-conditioned prompts, which modulate visual features through a text-driven Prompt Modulation mechanism. Key innovations include Prompt Text Matched (PTM) supervision and Diversified Style Prompt (DSP), plus a gating-based feature modulation that yields robust cross-domain performance. Empirically, CFPL achieves state-of-the-art results on multiple cross-domain benchmarks and demonstrates the practicality of leveraging text-conditioned prompts for generalizable FAS.

Abstract

Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. Existing methods either rely on domain labels to align domain-invariant feature spaces, or disentangle generalizable features from the whole sample, which inevitably lead to the distortion of semantic feature structures and achieve limited generalization. In this work, we make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features. Specifically, we propose a novel Class Free Prompt Learning (CFPL) paradigm for DG FAS, which utilizes two lightweight transformers, namely Content Q-Former (CQF) and Style Q-Former (SQF), to learn the different semantic prompts conditioned on content and style features by using a set of learnable query vectors, respectively. Thus, the generalizable prompt can be learned by two improvements: (1) A Prompt-Text Matched (PTM) supervision is introduced to ensure CQF learns visual representation that is most informative of the content description. (2) A Diversified Style Prompt (DSP) technology is proposed to diversify the learning of style prompts by mixing feature statistics between instance-specific styles. Finally, the learned text features modulate visual features to generalization through the designed Prompt Modulation (PM). Extensive experiments show that the CFPL is effective and outperforms the state-of-the-art methods on several cross-domain datasets.

CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing

TL;DR

The paper tackles domain generalization in face anti-spoofing by moving away from domain labels and disentangled representations, replacing them with class-free prompt learning powered by Vision-Language Models. It introduces CFPL, a framework built on CLIP that uses Content Q-Former (CQF) and Style Q-Former (SQF) to generate content- and style-conditioned prompts, which modulate visual features through a text-driven Prompt Modulation mechanism. Key innovations include Prompt Text Matched (PTM) supervision and Diversified Style Prompt (DSP), plus a gating-based feature modulation that yields robust cross-domain performance. Empirically, CFPL achieves state-of-the-art results on multiple cross-domain benchmarks and demonstrates the practicality of leveraging text-conditioned prompts for generalizable FAS.

Abstract

Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. Existing methods either rely on domain labels to align domain-invariant feature spaces, or disentangle generalizable features from the whole sample, which inevitably lead to the distortion of semantic feature structures and achieve limited generalization. In this work, we make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features. Specifically, we propose a novel Class Free Prompt Learning (CFPL) paradigm for DG FAS, which utilizes two lightweight transformers, namely Content Q-Former (CQF) and Style Q-Former (SQF), to learn the different semantic prompts conditioned on content and style features by using a set of learnable query vectors, respectively. Thus, the generalizable prompt can be learned by two improvements: (1) A Prompt-Text Matched (PTM) supervision is introduced to ensure CQF learns visual representation that is most informative of the content description. (2) A Diversified Style Prompt (DSP) technology is proposed to diversify the learning of style prompts by mixing feature statistics between instance-specific styles. Finally, the learned text features modulate visual features to generalization through the designed Prompt Modulation (PM). Extensive experiments show that the CFPL is effective and outperforms the state-of-the-art methods on several cross-domain datasets.
Paper Structure (14 sections, 7 equations, 4 figures, 5 tables)

This paper contains 14 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison with existing DG FAS methods. (a) the previous methods either rely on a projector to align domain-invariant feature spaces with adversarial training, or disentangle generalizable features from the whole sample with a decoupler, which inevitably leads to the distortion of semantic structures and achieves limited generalization. (b) Our CFPL framework is built on CLIP to learn generalized visual features by using the text features as weights of the classifier.
  • Figure 2: Our CFPL is built on CLIP radford2021learning consists of image encoder $\mathcal{V}$ and text encoder $\mathcal{T}$, and adaptes to FAS tasks via prompt learning with four contributions: (1) CQF and SQF. CFPL introduces two lightweight transformers, namely Content Q-Former (CQF) and Style Q-Former (SQF) to learn the different semantic prompts conditioned on content and style features from the image encoder by using a set of learnable query vectors, respectively; (2) Prompt-Text Matched (PTM) surpervision. The fixed template description of each sample is used as a supervise to ensure CQF learns semantic visual representation; (3) A Diversified Style Prompt (DSP). The style from each layer of the image encoder is diversified through mixing feature statistics; (4) Prompt Modulation (PM). The generalized visual feature is adjusted by the modulation factor, which is generated by the text feature through the designed modulation function.
  • Figure 3: The results of each method on three metrics across all sub-protocols, where the red line represents the Baseline, and the blue line represents our CFPL. For the HTER metric, the smaller area enclosed by lines, the better performance of the corresponding methods. The opposite conclusion applies to metrics AUC and TPR@FPR=1%.
  • Figure 4: Using visualization tool Chefer_2021_ICCV, the attention maps on all sub-protocols from Protocol 1, where the Baseline caused classification errors due to its failure to detect spoofing regions, and our CFPL correctly classifies these samples by correcting the region of interest.