Table of Contents
Fetching ...

Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing

Honglu Zhang, Zhiqin Fang, Ningning Zhao, Saihui Hou, Long Ma, Renwang Pei, Zhaofeng He

TL;DR

FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS, is introduced and a CoT-Enhanced Progressive Learning (CEPL) strategy is introduced to better leverage the CoT data and boost model performance on FAS tasks.

Abstract

Face Anti-Spoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image-text understanding and semantic reasoning, suggesting that integrating visual and linguistic co-inference into FAS can substantially improve both robustness and interpretability. However, the lack of a high-quality vision-language multimodal dataset has been a critical bottleneck. To address this, we introduce FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS. FaceCoT covers 14 spoofing attack types and enriches model learning with high-quality CoT VQA annotations. Meanwhile, we develop a caption model refined via reinforcement learning to expand the dataset and enhance annotation quality. Furthermore, we introduce a CoT-Enhanced Progressive Learning (CEPL) strategy to better leverage the CoT data and boost model performance on FAS tasks. Extensive experiments demonstrate that models trained with FaceCoT and CEPL outperform state-of-the-art methods on multiple benchmark datasets.

Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing

TL;DR

FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS, is introduced and a CoT-Enhanced Progressive Learning (CEPL) strategy is introduced to better leverage the CoT data and boost model performance on FAS tasks.

Abstract

Face Anti-Spoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image-text understanding and semantic reasoning, suggesting that integrating visual and linguistic co-inference into FAS can substantially improve both robustness and interpretability. However, the lack of a high-quality vision-language multimodal dataset has been a critical bottleneck. To address this, we introduce FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS. FaceCoT covers 14 spoofing attack types and enriches model learning with high-quality CoT VQA annotations. Meanwhile, we develop a caption model refined via reinforcement learning to expand the dataset and enhance annotation quality. Furthermore, we introduce a CoT-Enhanced Progressive Learning (CEPL) strategy to better leverage the CoT data and boost model performance on FAS tasks. Extensive experiments demonstrate that models trained with FaceCoT and CEPL outperform state-of-the-art methods on multiple benchmark datasets.

Paper Structure

This paper contains 57 sections, 1 equation, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Example of the FaceCoT, illustrating the six CoT components: caption, facial description, facial attributes, reasoning, spoofing description, and conclusion.
  • Figure 2: (a) The data types in FaceCoT. It comprises 3 major spoofing types and 14 subtypes. (b) Comparison results with state-of-the-art methods on 11 benchmark FAS datasets. Our method achieves the highest performance on every evaluation set.
  • Figure 3: This diagram illustrates the entire process of data annotation and expansion for the FaceCoT dataset. (a) Data Annotation: This step shows the annotation process of FaceCoT-Gold100K. (b) Data Expansion: This phase shows the annotation process of FaceCoT-Silver982K. (c) RL in FaceCoT: This part shows the RL in the training of the FAS caption model.
  • Figure 4: Our proposed CoT-Enhanced Progressive Learning framework consists of two stages: (a) Visual Enhancement Pre-training fine-tunes on CoT annotations to strengthen visual perception and representation; (b) Multi-task Joint Training, which inherits the vision encoder learned in Stage-1 and jointly optimizes both CoT generation and binary classification.
  • Figure 5: The figure shows the outputs of different FAS methods: (a) Traditional binary classification method; (b) Other MLLMs I-FAS can answer classifications and provide simple descriptions; (c) Our method can not only answer classification questions, but also provide systematic reasoning analysis.
  • ...and 9 more figures