Table of Contents
Fetching ...

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

Rizhao Cai, Zitong Yu, Chenqi Kong, Haoliang Li, Changsheng Chen, Yongjian Hu, Alex Kot

TL;DR

The paper tackles cross-domain generalization in Face Anti-Spoofing by integrating a novel Statistical Adapter (S-Adapter) into pre-trained Vision Transformers under Efficient Parameter Transfer Learning. It introduces token maps and differentiable token histograms to capture local discriminative and statistical information, then regularizes domain style variance with Token Style Regularization (TSR) based on Gram matrices. The approach yields state-of-the-art zero-shot and few-shot cross-domain results and robust unseen-attack detection with minimal overhead (~0.5% MACs and parameters). This work enhances ViT-based FAS robustness under diverse imaging conditions and attack types, with potential extensions to related biometric and forensics tasks. The combination of histogram-based token statistics and TSR provides a practical, generalizable path for adapting large pre-trained models to domain-diverse security applications.

Abstract

Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces. State-of-the-art FAS techniques predominantly rely on deep learning models but their cross-domain generalization capabilities are often hindered by the domain shift problem, which arises due to different distributions between training and testing data. In this study, we develop a generalized FAS method under the Efficient Parameter Transfer Learning (EPTL) paradigm, where we adapt the pre-trained Vision Transformer models for the FAS task. During training, the adapter modules are inserted into the pre-trained ViT model, and the adapters are updated while other pre-trained parameters remain fixed. We find the limitations of previous vanilla adapters in that they are based on linear layers, which lack a spoofing-aware inductive bias and thus restrict the cross-domain generalization. To address this limitation and achieve cross-domain generalized FAS, we propose a novel Statistical Adapter (S-Adapter) that gathers local discriminative and statistical information from localized token histograms. To further improve the generalization of the statistical tokens, we propose a novel Token Style Regularization (TSR), which aims to reduce domain style variance by regularizing Gram matrices extracted from tokens across different domains. Our experimental results demonstrate that our proposed S-Adapter and TSR provide significant benefits in both zero-shot and few-shot cross-domain testing, outperforming state-of-the-art methods on several benchmark tests. We will release the source code upon acceptance.

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

TL;DR

The paper tackles cross-domain generalization in Face Anti-Spoofing by integrating a novel Statistical Adapter (S-Adapter) into pre-trained Vision Transformers under Efficient Parameter Transfer Learning. It introduces token maps and differentiable token histograms to capture local discriminative and statistical information, then regularizes domain style variance with Token Style Regularization (TSR) based on Gram matrices. The approach yields state-of-the-art zero-shot and few-shot cross-domain results and robust unseen-attack detection with minimal overhead (~0.5% MACs and parameters). This work enhances ViT-based FAS robustness under diverse imaging conditions and attack types, with potential extensions to related biometric and forensics tasks. The combination of histogram-based token statistics and TSR provides a practical, generalizable path for adapting large pre-trained models to domain-diverse security applications.

Abstract

Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces. State-of-the-art FAS techniques predominantly rely on deep learning models but their cross-domain generalization capabilities are often hindered by the domain shift problem, which arises due to different distributions between training and testing data. In this study, we develop a generalized FAS method under the Efficient Parameter Transfer Learning (EPTL) paradigm, where we adapt the pre-trained Vision Transformer models for the FAS task. During training, the adapter modules are inserted into the pre-trained ViT model, and the adapters are updated while other pre-trained parameters remain fixed. We find the limitations of previous vanilla adapters in that they are based on linear layers, which lack a spoofing-aware inductive bias and thus restrict the cross-domain generalization. To address this limitation and achieve cross-domain generalized FAS, we propose a novel Statistical Adapter (S-Adapter) that gathers local discriminative and statistical information from localized token histograms. To further improve the generalization of the statistical tokens, we propose a novel Token Style Regularization (TSR), which aims to reduce domain style variance by regularizing Gram matrices extracted from tokens across different domains. Our experimental results demonstrate that our proposed S-Adapter and TSR provide significant benefits in both zero-shot and few-shot cross-domain testing, outperforming state-of-the-art methods on several benchmark tests. We will release the source code upon acceptance.
Paper Structure (31 sections, 11 equations, 11 figures, 7 tables)

This paper contains 31 sections, 11 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: (a) In the traditional transfer learning paradigm of training a ViT model for the Face Anti-Spoofing task, a pre-trained Vision Transformer (ViT) model is used for initialization, which can utilize the knowledge from the pre-training dataset. Usually, the entire or a large proportion of the model parameters are fine-tuned. (b) In the cutting-edge Efficient Parameter Transfer Learning paradigm, adapter modules are integrated into a pre-trained ViT model. Throughout the training process, only the adapter module parameters are updated and the pre-trained parameters are fixed. Previous vanilla adapters, which are based on linear layers, lack the task-aware inductive bias chen2023visionadapter, thereby limiting the utilization of pre-trained models. (c) Our proposed S-Adapter addresses this limitation by extracting localized token histograms to extract statistical information, enabling more efficient fine-tuning of the pre-trained ViT model for cross-domain generalized face anti-spoofing.
  • Figure 2: (a) The process of traditional texture analysis method for face anti-spoofing: handcraft features (LBP) are first extracted, which are often sensitive to illumination changes. Then, the histogram features are extracted as final representations for the classifier, which is more robust to lighting changes. (b) Our adapter extracts local information from spatial tokens and extracts token histogram, which is inspired by (a), for improving cross-domain performance.
  • Figure 3: The structure of ViT backbone and our S-Adapter. The S-Adapter is inserted into the ViT and updated during the training.
  • Figure 4: The overall optimization. The bona fide and attack examples are used to calculate the binary cross-entropy loss ($\mathcal{L}_{BCE}$), but only the bona fide examples are used to calculate the Token Style Regularization ($\mathcal{L}_{TSR}$).
  • Figure 5: Ablation study about adapters. Red bars convey the results of our S-Adapters. Green bars convey the results of removing the token histogram from our S-Adapters. Blue bars convey the results of further removing the gradient information ($\theta=0$).
  • ...and 6 more figures