Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

Guosheng Zhang; Keyao Wang; Haixiao Yue; Ajian Liu; Gang Zhang; Kun Yao; Errui Ding; Jingdong Wang

Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

Guosheng Zhang, Keyao Wang, Haixiao Yue, Ajian Liu, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang

TL;DR

A multimodal large language model (MLLM) framework for FAS is introduced, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm, and a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former.

Abstract

Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model's supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model's perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.

Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (29 sections, 4 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 4 equations, 10 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Face Anti-Spoofing
Vision-Language Models
Methodology
Spoof-aware Captioning and Filtering
Interpretative Instruction Tuning
Revisiting MLLMs:
Globally Aware Connector:
Training with Lopsided LM Loss:
Experiments
Experimental Setup
Databases, Protocols, and Evaluation Metrics:
Implementation Details:
Comparison Results
...and 14 more sections

Figures (10)

Figure 1: Comparison of performance under a challenging One to Eleven benchmark, where training is restricted to a single source domain (CelebA-Spoof) while testing across 11 target domains. The graph illustrates the notable superiority of our method (red point) compared to existing methods under the condition of a limited source domain.
Figure 2: Overview of the proposed Interpretable Face Anti-Spoofing (I-FAS) framework. section illustrates the process of our proposed Spoof-aware Captioning and Filtering (SCF) strategy. The central section details the model architecture, which includes a frozen visual encoder, a pre-trained language model (LLM), and the Globally Aware Connector (GAC). The rightmost section presents a schematic representation of the Lopsided Language Model (L-LM) loss.
Figure 3: Illustration of some image-caption pair from the spoof sample. The captions $T_{F}$ and $T_{S}$ are generated by general captioner $C_{G}$ and spoof-aware captioner $C_{S}$, respectively. The keywords instrumental in identifying spoof cues are distinctly highlighted in red within the captions.
Figure 4: The output responses of I-FAS on some images from SIW-M-V2 and Rose-Youtu dataset under the unified question: "Is this photo of a real person?". The green box represents the real person, and the red box represents the spoof sample.
Figure 5: Left: Ablation analysis of hyperparameters $\alpha$ of lopsided LM loss in Protocol 2. Right: Visualization of loss convergence behavior with and without lopsided LM loss.
...and 5 more figures

Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

TL;DR

Abstract

Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)