SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models

Yichen Shi; Yuhao Gao; Yingxin Lai; Hongyang Wang; Jun Feng; Lei He; Jun Wan; Changsheng Chen; Zitong Yu; Xiaochun Cao

SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models

Yichen Shi, Yuhao Gao, Yingxin Lai, Hongyang Wang, Jun Feng, Lei He, Jun Wan, Changsheng Chen, Zitong Yu, Xiaochun Cao

TL;DR

SHIELD establishes a dedicated benchmark to probe multimodal LLMs for face security tasks, evaluating true/false and multiple-choice reasoning across FAS and forgery detection with diverse modalities. It introduces the MA-COT framework to enrich attribute-based reasoning and demonstrates, through extensive cross-model experiments, that MLLMs have notable potential but exhibit modality- and prompt-dependent limitations. The study highlights the impact of prompt design, multimodal cues, and task-specific fine-tuning, and advocates richer datasets and metrics to advance robust, interpretable face security solutions. Collectively, SHIELD provides a structured, extensible platform for advancing MLLMs in real-world face authentication security tasks and suggests concrete directions for dataset growth, evaluation richness, and cross-domain collaboration.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-related tasks, capitalizing on their visual semantic comprehension and reasoning capabilities. However, their ability to detect subtle visual spoofing and forgery clues in face attack detection tasks remains underexplored. In this paper, we introduce a benchmark, SHIELD, to evaluate MLLMs for face spoofing and forgery detection. Specifically, we design true/false and multiple-choice questions to assess MLLM performance on multimodal face data across two tasks. For the face anti-spoofing task, we evaluate three modalities (i.e., RGB, infrared, and depth) under six attack types. For the face forgery detection task, we evaluate GAN-based and diffusion-based data, incorporating visual and acoustic modalities. We conduct zero-shot and few-shot evaluations in standard and chain of thought (COT) settings. Additionally, we propose a novel multi-attribute chain of thought (MA-COT) paradigm for describing and judging various task-specific and task-irrelevant attributes of face images. The findings of this study demonstrate that MLLMs exhibit strong potential for addressing the challenges associated with the security of facial recognition technology applications.

SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (29 sections, 1 equation, 13 figures, 28 tables)

This paper contains 29 sections, 1 equation, 13 figures, 28 tables.

Introduction
Related Work
Face Anti-Spoofing
Face Forgery Detection
Multimodal Large Language Model
Existing MLLM Benchmark
Task Design
SHIELD
Data Collection
Task Design
MA-COT
Experiments
Eexperimental Setup
Evaluation Metrics
Results on Face Anti-Spoofing
...and 14 more sections

Figures (13)

Figure 1: Performance of various multimodal large language models on (a) true/false and (b) multiple-choice questions across different types of attacks. The term "bona fide" is used to denote a genuine face image. Print refers to a printed photograph, and replay refers to a replayed video. This demonstrates their superior ability to distinguish between physical and digital attacks. In (a), the larger the area of each colored polygon, the better the performance. Qwen-VL and mPLUG-owl outperform other models. In (b), GPT4V shows the best performance compared to the others. The color brown represents the face anti-spoofing task, deep blue represents the face forgery detection task, and orange represents the joint task. AVG: average.
Figure 2: Examples of our collected datasets. The images are sourced from the WMCA george2019biometric and the FF++ rossler2019faceforensics datasets
Figure 3: Pipeline of task design. The ellipses indicate that the structures are consistent with the task design framework shown above. COT: chain of thought
Figure 4: The MA-COT process. This process is designed to extract relevant key attributes for various tasks and input these attributes along with the face images under evaluation into MLLMs all attributes we used are shown in Table \ref{['Attributes-set']}. This approach aims to guide the MLLM to analyze the images from multiple perspectives, thereby identifying potential clues of attacks and determining whether the images are of real faces. The illustration provides examples of key attribute extraction and its application scenarios in separate FAS, separate face forgery detection, and unified face spoof & forgery detection. The images are sourced from the WMCA george2019biometric and the FF++ rossler2019faceforensics datasets. MA-COT: multi-attribute chain of thought
Figure 5: Prompt design. The diagram represents a matrix of test results for selecting prompts. On the left are the candidate prompts, and along the top are the test cases used to evaluate the selection of prompts, which include real face, rigid mask attack, replay attack, paper mask attack and flexible mask attack. The responses from GPT4V and Gemini are included. Yellow(red) highlights the correct (incorrect) responses. The images are sourced from the WMCA george2019biometric and the FF++ rossler2019faceforensics datasets
...and 8 more figures

SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models

TL;DR

Abstract

SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)