Table of Contents
Fetching ...

Advancing Cross-Domain Generalizability in Face Anti-Spoofing: Insights, Design, and Metrics

Hyojin Kim, Jiyoon Lee, Yonghyun Jeong, Haneol Jang, YoungJoon Yoo

TL;DR

The paper tackles cross-domain generalization in face anti-spoofing (FAS) for video inputs, noting that frame-wise predictions can be unstable in real-world scenarios. It introduces video-wise aggregation and novel bias-variance metrics, defined as $B(\,\cdot\,) = \frac{1}{N} \sum_{i=1}^N (Y_i - \hat{Y}_i)^2$ and $V(P_i) = \frac{1}{N} \sum_{i=1}^N \sigma^2(P_i)$ with $\sigma^2(P_i) = \frac{1}{M_i} \sum_{j=1}^{M_i} (P_{ij}-\bar{P}_i)^2$, to quantify temporal robustness. The authors propose ECLIPS, an ensemble framework comprising a CLIP Visual Encoder-based base learner and a learnable decision fusion module, trained with Monte Carlo dropout to capture uncertainty and improve generalization across datasets such as OCIM, CelebA-Spoof, and SiW-Mv2. Key contributions include the introduction of bias-variance robustness metrics for FAS, demonstration that backbone scaling is insufficient for generalization, and state-of-the-art performance (HTER and AUC) through ensemble design on multiple cross-domain benchmarks. The work advances practical video FAS by enabling uncertainty-aware training and robust, scalable deployment with smaller backbones.

Abstract

This paper presents a novel perspective for enhancing anti-spoofing performance in zero-shot data domain generalization. Unlike traditional image classification tasks, face anti-spoofing datasets display unique generalization characteristics, necessitating novel zero-shot data domain generalization. One step forward to the previous frame-wise spoofing prediction, we introduce a nuanced metric calculation that aggregates frame-level probabilities for a video-wise prediction, to tackle the gap between the reported frame-wise accuracy and instability in real-world use-case. This approach enables the quantification of bias and variance in model predictions, offering a more refined analysis of model generalization. Our investigation reveals that simply scaling up the backbone of models does not inherently improve the mentioned instability, leading us to propose an ensembled backbone method from a Bayesian perspective. The probabilistically ensembled backbone both improves model robustness measured from the proposed metric and spoofing accuracy, and also leverages the advantages of measuring uncertainty, allowing for enhanced sampling during training that contributes to model generalization across new datasets. We evaluate the proposed method from the benchmark OMIC dataset and also the public CelebA-Spoof and SiW-Mv2. Our final model outperforms existing state-of-the-art methods across the datasets, showcasing advancements in Bias, Variance, HTER, and AUC metrics.

Advancing Cross-Domain Generalizability in Face Anti-Spoofing: Insights, Design, and Metrics

TL;DR

The paper tackles cross-domain generalization in face anti-spoofing (FAS) for video inputs, noting that frame-wise predictions can be unstable in real-world scenarios. It introduces video-wise aggregation and novel bias-variance metrics, defined as and with , to quantify temporal robustness. The authors propose ECLIPS, an ensemble framework comprising a CLIP Visual Encoder-based base learner and a learnable decision fusion module, trained with Monte Carlo dropout to capture uncertainty and improve generalization across datasets such as OCIM, CelebA-Spoof, and SiW-Mv2. Key contributions include the introduction of bias-variance robustness metrics for FAS, demonstration that backbone scaling is insufficient for generalization, and state-of-the-art performance (HTER and AUC) through ensemble design on multiple cross-domain benchmarks. The work advances practical video FAS by enabling uncertainty-aware training and robust, scalable deployment with smaller backbones.

Abstract

This paper presents a novel perspective for enhancing anti-spoofing performance in zero-shot data domain generalization. Unlike traditional image classification tasks, face anti-spoofing datasets display unique generalization characteristics, necessitating novel zero-shot data domain generalization. One step forward to the previous frame-wise spoofing prediction, we introduce a nuanced metric calculation that aggregates frame-level probabilities for a video-wise prediction, to tackle the gap between the reported frame-wise accuracy and instability in real-world use-case. This approach enables the quantification of bias and variance in model predictions, offering a more refined analysis of model generalization. Our investigation reveals that simply scaling up the backbone of models does not inherently improve the mentioned instability, leading us to propose an ensembled backbone method from a Bayesian perspective. The probabilistically ensembled backbone both improves model robustness measured from the proposed metric and spoofing accuracy, and also leverages the advantages of measuring uncertainty, allowing for enhanced sampling during training that contributes to model generalization across new datasets. We evaluate the proposed method from the benchmark OMIC dataset and also the public CelebA-Spoof and SiW-Mv2. Our final model outperforms existing state-of-the-art methods across the datasets, showcasing advancements in Bias, Variance, HTER, and AUC metrics.
Paper Structure (22 sections, 3 figures, 6 tables)

This paper contains 22 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Visualization of images containing a sequence of five frames extracted from the video within the CASIA zhang2012face. The images demonstrate a range of facial expressions and pose variations.
  • Figure 2: Architecture of the proposed FAS Model ECLIPS. (a) The ECLIPS model for training utilizes only a Visual Encoder. (b) The Standard variant is a dual-stream model that integrates textual information with visual features. (c) The ECLIPS, at inference, only utilizes a Visual Encoder.
  • Figure 3: The scatter plot of representative FAS models from Bias to HTER. Notably, the ECLIPS model is highlighted in the bottom-left, indicating its superior accuracy and generalization ability in FAS.