Table of Contents
Fetching ...

Enhancing Learnable Descriptive Convolutional Vision Transformer for Face Anti-Spoofing

Pei-Kai Huanga, Jun-Xiong Chong, Ming-Tsung Hsu, Fang-Yu Hsu, Chiou-Ting Hsu

TL;DR

The paper tackles face anti-spoofing (FAS) by addressing fine-grained supervision gaps, partial spoofing, and cross-domain generalization. It extends the Learnable Descriptive Convolutional Vision Transformer (LDCformer) with three training strategies: dual-attention supervision for region-specific liveness cues, self-challenging supervision to create hard, mixed-attacks, and transitional triplet mining to improve domain generalization. Across extensive intra- and cross-domain benchmarks, the approach achieves state-of-the-art results and demonstrates robust performance on unseen and partial attacks, validating the effectiveness of joint supervision. The work also discusses computational trade-offs and proposes future directions toward multi-modal FAS and online/test-time adaptation to sustain gains in practical applications.

Abstract

Face anti-spoofing (FAS) heavily relies on identifying live/spoof discriminative features to counter face presentation attacks. Recently, we proposed LDCformer to successfully incorporate the Learnable Descriptive Convolution (LDC) into ViT, to model long-range dependency of locally descriptive features for FAS. In this paper, we propose three novel training strategies to effectively enhance the training of LDCformer to largely boost its feature characterization capability. The first strategy, dual-attention supervision, is developed to learn fine-grained liveness features guided by regional live/spoof attentions. The second strategy, self-challenging supervision, is designed to enhance the discriminability of the features by generating challenging training data. In addition, we propose a third training strategy, transitional triplet mining strategy, through narrowing the cross-domain gap while maintaining the transitional relationship between live and spoof features, to enlarge the domain-generalization capability of LDCformer. Extensive experiments show that LDCformer under joint supervision of the three novel training strategies outperforms previous methods.

Enhancing Learnable Descriptive Convolutional Vision Transformer for Face Anti-Spoofing

TL;DR

The paper tackles face anti-spoofing (FAS) by addressing fine-grained supervision gaps, partial spoofing, and cross-domain generalization. It extends the Learnable Descriptive Convolutional Vision Transformer (LDCformer) with three training strategies: dual-attention supervision for region-specific liveness cues, self-challenging supervision to create hard, mixed-attacks, and transitional triplet mining to improve domain generalization. Across extensive intra- and cross-domain benchmarks, the approach achieves state-of-the-art results and demonstrates robust performance on unseen and partial attacks, validating the effectiveness of joint supervision. The work also discusses computational trade-offs and proposes future directions toward multi-modal FAS and online/test-time adaptation to sustain gains in practical applications.

Abstract

Face anti-spoofing (FAS) heavily relies on identifying live/spoof discriminative features to counter face presentation attacks. Recently, we proposed LDCformer to successfully incorporate the Learnable Descriptive Convolution (LDC) into ViT, to model long-range dependency of locally descriptive features for FAS. In this paper, we propose three novel training strategies to effectively enhance the training of LDCformer to largely boost its feature characterization capability. The first strategy, dual-attention supervision, is developed to learn fine-grained liveness features guided by regional live/spoof attentions. The second strategy, self-challenging supervision, is designed to enhance the discriminability of the features by generating challenging training data. In addition, we propose a third training strategy, transitional triplet mining strategy, through narrowing the cross-domain gap while maintaining the transitional relationship between live and spoof features, to enlarge the domain-generalization capability of LDCformer. Extensive experiments show that LDCformer under joint supervision of the three novel training strategies outperforms previous methods.

Paper Structure

This paper contains 37 sections, 11 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Examples of full-face and partial spoof attacks from the PADISI-Face dataset rostami2021detection.
  • Figure 2: Overview of LDCformer huang2023ldcformer, where we additionally include $N_1$ LDCformer encoders $\textbf{E}_{LDCformer}$ (purple solid box) between the linear projection $\textbf{LP}$ and the standard Transformer encoders $\textbf{E}_{ViT}$ (blue solid box) in ViT.
  • Figure 3: The LDCformer Encoder $\textbf{E}_{LDCformer}$ in huang2023ldcformer (i.e., the purple solid box in Figure \ref{['fig:LDCformer']}) consists of a LDC encoder $\textbf{E}_{LDC}$ (red solid box), a LDC-based Multi-Head Self Attention (LDC-MSA), and a multilayer perceptron block (MLP).
  • Figure 4: Illustration of the proposed three training strategies. First, to address the lack of fine-grained labels, we adopt the auxiliary network $\textbf{E}_{Res18}-\textbf{CF}_{Res18}$ to generate the activation maps $\mathbf{A}_l$ and $\mathbf{A}_s$, which provide fine-grained supervision for LDCformer (Dual-Attention Supervision $\mathcal{L}_{dual}$). Next, we mix live and spoof images to generate challenging images $x^\prime$ and their corresponding attention-based ground truth ${\mathbf{\hat{A}}}_l$ and ${\mathbf{\hat{A}}}_s$ to supervise LDCformer in detecting subtle partial spoofing attacks (Self-Challenging Supervision $\mathcal{L}_{sc}$). Finally, we examine the relationship between different spoof attacks and live images within the features $\mathbf{z}^{N1+N2}$ to enhance LDCformer for tackling cross-domain testing issues (Transitional Triplet Mining $\mathcal{L}_{tran-trip}$).
  • Figure 5: Generation of quasi-ground truth for the two attention estimators LE and SE from (a) a live image and (b) a spoof image.
  • ...and 4 more figures