Table of Contents
Fetching ...

SLIP: Spoof-Aware One-Class Face Anti-Spoofing with Language Image Pretraining

Pei-Kai Huang, Jun-Xiong Chong, Cheng-Hsuan Chiang, Tzu-Hsien Chen, Tyng-Luh Liu, Chiou-Ting Hsu

TL;DR

Face anti-spoofing in the one-class setting remains challenging due to absence of spoof examples and domain variations. The proposed SLIP framework uses CLIP-based image/text encoders with language-guided spoof cue maps, prompt-driven liveness feature disentanglement, and spoof-like feature augmentation to learn disentangled live features $z$ while generating diverse spoof representations. The method introduces L_L, L_S, L_FD, L_FA, L_R, and L_A losses to enforce zero cues for live data, align live/spoof features, disentangle content from liveness, and diversify spoof features, achieving strong intra- and cross-domain performance across seven datasets. Empirical results show SLIP surpasses prior one-class FAS methods and attains competitive results with two-class approaches, illustrating improved generalization to unseen spoof types and physical adversarial attacks.

Abstract

Face anti-spoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision-language pretrained (VLP) models, recent two-class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one-class FAS methods. The one-class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one-class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (e.g., facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof-aware one-class face anti-spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof-attack-related objects (e.g., paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language-guided spoof cue map estimation to enhance one-class FAS models by simulating whether the underlying faces are covered by attack-related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt-driven liveness feature disentanglement to alleviate live/spoof-irrelative domain variations by disentangling live/spoof-relevant and domain-dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof-like image features and thus diversify latent spoof features to facilitate the learning of one-class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one-class FAS methods.

SLIP: Spoof-Aware One-Class Face Anti-Spoofing with Language Image Pretraining

TL;DR

Face anti-spoofing in the one-class setting remains challenging due to absence of spoof examples and domain variations. The proposed SLIP framework uses CLIP-based image/text encoders with language-guided spoof cue maps, prompt-driven liveness feature disentanglement, and spoof-like feature augmentation to learn disentangled live features while generating diverse spoof representations. The method introduces L_L, L_S, L_FD, L_FA, L_R, and L_A losses to enforce zero cues for live data, align live/spoof features, disentangle content from liveness, and diversify spoof features, achieving strong intra- and cross-domain performance across seven datasets. Empirical results show SLIP surpasses prior one-class FAS methods and attains competitive results with two-class approaches, illustrating improved generalization to unseen spoof types and physical adversarial attacks.

Abstract

Face anti-spoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision-language pretrained (VLP) models, recent two-class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one-class FAS methods. The one-class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one-class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (e.g., facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof-aware one-class face anti-spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof-attack-related objects (e.g., paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language-guided spoof cue map estimation to enhance one-class FAS models by simulating whether the underlying faces are covered by attack-related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt-driven liveness feature disentanglement to alleviate live/spoof-irrelative domain variations by disentangling live/spoof-relevant and domain-dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof-like image features and thus diversify latent spoof features to facilitate the learning of one-class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one-class FAS methods.

Paper Structure

This paper contains 30 sections, 11 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Liveness feature disentanglement: (a) Existing language-guided two-class FAS methods overlook the presence of domain information (e.g., face content) in prompt learning to extract entangled live/spoof features. (b) One-class FAS methods learn entangled live features from live training images. (c) By exploring the mutual relatedness in the given text prompts, the proposed SLIP to one-class FAS learns pure live features via separating domain and live/spoof features.
  • Figure 2: The proposed SLIP consists of one image encoder $E_I$, one text encoder $E_T$, one spoof cue map decoder $D$, and one fusion module $R$.
  • Figure 3: Examples of $\widetilde{\textbf{m}}$ with fixed size produced by using different positions as specified in the spoof prompts.
  • Figure 4: Illustration of liveness feature disentanglement.
  • Figure 5: Illustration of (a) spoof prompt feature reconstruction and (b) spoof-like image feature augmentation.
  • ...and 2 more figures