Table of Contents
Fetching ...

TeG-DG: Textually Guided Domain Generalization for Face Anti-Spoofing

Lianrui Mu, Jianhong Bai, Xiaoxuan He, Jiangnan Ye, Xiaoyu Liang, Yuchen Yang, Jiedong Zhuang, Haoji Hu

TL;DR

TeG-DG addresses cross-domain generalization in Face Anti-Spoofing by injecting textual supervision into training. It introduces a Hierarchical Attention Fusion module to fuse multi-level visual cues and a Textual-Enhanced Visual Discriminator that aligns visual features with text prototypes via a vision-language triplet loss, while keeping inference purely visual. Across four datasets under Leave-One-Out and in few-shot regimes, TeG-DG achieves state-of-the-art generalization, including substantial gains when data are extremely limited. This work demonstrates the practical value of text-driven cross-domain alignment for robust FAS in real-world settings.

Abstract

Enhancing the domain generalization performance of Face Anti-Spoofing (FAS) techniques has emerged as a research focus. Existing methods are dedicated to extracting domain-invariant features from various training domains. Despite the promising performance, the extracted features inevitably contain residual style feature bias (e.g., illumination, capture device), resulting in inferior generalization performance. In this paper, we propose an alternative and effective solution, the Textually Guided Domain Generalization (TeG-DG) framework, which can effectively leverage text information for cross-domain alignment. Our core insight is that text, as a more abstract and universal form of expression, can capture the commonalities and essential characteristics across various attacks, bridging the gap between different image domains. Contrary to existing vision-language models, the proposed framework is elaborately designed to enhance the domain generalization ability of the FAS task. Concretely, we first design a Hierarchical Attention Fusion (HAF) module to enable adaptive aggregation of visual features at different levels; Then, a Textual-Enhanced Visual Discriminator (TEVD) is proposed for not only better alignment between the two modalities but also to regularize the classifier with unbiased text features. TeG-DG significantly outperforms previous approaches, especially in situations with extremely limited source domain data (~14% and ~12% improvements on HTER and AUC respectively), showcasing impressive few-shot performance.

TeG-DG: Textually Guided Domain Generalization for Face Anti-Spoofing

TL;DR

TeG-DG addresses cross-domain generalization in Face Anti-Spoofing by injecting textual supervision into training. It introduces a Hierarchical Attention Fusion module to fuse multi-level visual cues and a Textual-Enhanced Visual Discriminator that aligns visual features with text prototypes via a vision-language triplet loss, while keeping inference purely visual. Across four datasets under Leave-One-Out and in few-shot regimes, TeG-DG achieves state-of-the-art generalization, including substantial gains when data are extremely limited. This work demonstrates the practical value of text-driven cross-domain alignment for robust FAS in real-world settings.

Abstract

Enhancing the domain generalization performance of Face Anti-Spoofing (FAS) techniques has emerged as a research focus. Existing methods are dedicated to extracting domain-invariant features from various training domains. Despite the promising performance, the extracted features inevitably contain residual style feature bias (e.g., illumination, capture device), resulting in inferior generalization performance. In this paper, we propose an alternative and effective solution, the Textually Guided Domain Generalization (TeG-DG) framework, which can effectively leverage text information for cross-domain alignment. Our core insight is that text, as a more abstract and universal form of expression, can capture the commonalities and essential characteristics across various attacks, bridging the gap between different image domains. Contrary to existing vision-language models, the proposed framework is elaborately designed to enhance the domain generalization ability of the FAS task. Concretely, we first design a Hierarchical Attention Fusion (HAF) module to enable adaptive aggregation of visual features at different levels; Then, a Textual-Enhanced Visual Discriminator (TEVD) is proposed for not only better alignment between the two modalities but also to regularize the classifier with unbiased text features. TeG-DG significantly outperforms previous approaches, especially in situations with extremely limited source domain data (~14% and ~12% improvements on HTER and AUC respectively), showcasing impressive few-shot performance.
Paper Structure (23 sections, 8 equations, 17 figures, 6 tables, 4 algorithms)

This paper contains 23 sections, 8 equations, 17 figures, 6 tables, 4 algorithms.

Figures (17)

  • Figure 1: Comparison with previous Face Anti-Spoofing (FAS) methods. Our approach leverages the text information for better generalization without using domain labels.
  • Figure 2: Overview of the proposed Textually Guided Domain Generalization (TeG-DG) framework. The framework contains Text Prompter (TP) for text prompt generation, the Hierarchical Attention Fusion(HAF) module for fused visual feature extraction, and the Texual-Enhanced Visual Discriminator (TEVD) for integrating text information.
  • Figure 3: Text prompt generation for a training image.
  • Figure 4: Illustration of the designed Hierarchical Attention Fusion (HAF) module. The proposed HAF is a lightweight plug-and-play module that can be easily integrated into mainstream ViT models.
  • Figure 5: The proposed textual-enhanced visual discriminator (TEVD). TEVD consists of a vision-language triplet loss and a multi-modal classifier.
  • ...and 12 more figures