Table of Contents
Fetching ...

Benchmarking Joint Face Spoofing and Forgery Detection with Visual and Physiological Cues

Zitong Yu, Rizhao Cai, Zhi Li, Wenhan Yang, Jingang Shi, Alex C. Kot

TL;DR

This work tackles the generalization gap in face attack detection by proposing the first joint benchmark for face spoofing and forgery detection that exploits both visual appearance and physiological rPPG cues. It introduces a two-branch physiological network to strengthen rPPG periodicity cues and a weighted normalization fusion to align appearance and rPPG features before fusion, enabling effective multimodal and multi-task learning. Across extensive experiments, joint training on both tasks improves cross-domain generalization for many models and fusion strategies, though results vary with architecture and task setup. The benchmark and findings aim to unify FAS and deepfake detection research and spur development of more robust, generalizable defenses for real-world face security systems.

Abstract

Face anti-spoofing (FAS) and face forgery detection play vital roles in securing face biometric systems from presentation attacks (PAs) and vicious digital manipulation (e.g., deepfakes). Despite promising performance upon large-scale data and powerful deep models, the generalization problem of existing approaches is still an open issue. Most of recent approaches focus on 1) unimodal visual appearance or physiological (i.e., remote photoplethysmography (rPPG)) cues; and 2) separated feature representation for FAS or face forgery detection. On one side, unimodal appearance and rPPG features are respectively vulnerable to high-fidelity face 3D mask and video replay attacks, inspiring us to design reliable multi-modal fusion mechanisms for generalized face attack detection. On the other side, there are rich common features across FAS and face forgery detection tasks (e.g., periodic rPPG rhythms and vanilla appearance for bonafides), providing solid evidence to design a joint FAS and face forgery detection system in a multi-task learning fashion. In this paper, we establish the first joint face spoofing and forgery detection benchmark using both visual appearance and physiological rPPG cues. To enhance the rPPG periodicity discrimination, we design a two-branch physiological network using both facial spatio-temporal rPPG signal map and its continuous wavelet transformed counterpart as inputs. To mitigate the modality bias and improve the fusion efficacy, we conduct a weighted batch and layer normalization for both appearance and rPPG features before multi-modal fusion. We find that the generalization capacities of both unimodal (appearance or rPPG) and multi-modal (appearance+rPPG) models can be obviously improved via joint training on these two tasks. We hope this new benchmark will facilitate the future research of both FAS and deepfake detection communities.

Benchmarking Joint Face Spoofing and Forgery Detection with Visual and Physiological Cues

TL;DR

This work tackles the generalization gap in face attack detection by proposing the first joint benchmark for face spoofing and forgery detection that exploits both visual appearance and physiological rPPG cues. It introduces a two-branch physiological network to strengthen rPPG periodicity cues and a weighted normalization fusion to align appearance and rPPG features before fusion, enabling effective multimodal and multi-task learning. Across extensive experiments, joint training on both tasks improves cross-domain generalization for many models and fusion strategies, though results vary with architecture and task setup. The benchmark and findings aim to unify FAS and deepfake detection research and spur development of more robust, generalizable defenses for real-world face security systems.

Abstract

Face anti-spoofing (FAS) and face forgery detection play vital roles in securing face biometric systems from presentation attacks (PAs) and vicious digital manipulation (e.g., deepfakes). Despite promising performance upon large-scale data and powerful deep models, the generalization problem of existing approaches is still an open issue. Most of recent approaches focus on 1) unimodal visual appearance or physiological (i.e., remote photoplethysmography (rPPG)) cues; and 2) separated feature representation for FAS or face forgery detection. On one side, unimodal appearance and rPPG features are respectively vulnerable to high-fidelity face 3D mask and video replay attacks, inspiring us to design reliable multi-modal fusion mechanisms for generalized face attack detection. On the other side, there are rich common features across FAS and face forgery detection tasks (e.g., periodic rPPG rhythms and vanilla appearance for bonafides), providing solid evidence to design a joint FAS and face forgery detection system in a multi-task learning fashion. In this paper, we establish the first joint face spoofing and forgery detection benchmark using both visual appearance and physiological rPPG cues. To enhance the rPPG periodicity discrimination, we design a two-branch physiological network using both facial spatio-temporal rPPG signal map and its continuous wavelet transformed counterpart as inputs. To mitigate the modality bias and improve the fusion efficacy, we conduct a weighted batch and layer normalization for both appearance and rPPG features before multi-modal fusion. We find that the generalization capacities of both unimodal (appearance or rPPG) and multi-modal (appearance+rPPG) models can be obviously improved via joint training on these two tasks. We hope this new benchmark will facilitate the future research of both FAS and deepfake detection communities.
Paper Structure (16 sections, 4 equations, 6 figures, 8 tables)

This paper contains 16 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Visualization of the modality (visual appearance and physiological rPPG) and task (face spoofing and forgery detection) matrix. Columns from left to right: 1) appearance models with RGB face inputs; 2) rPPG models with facial rPPG signals inputs; and 3) appearance+rPPG models with both RGB face and facial rPPG inputs. Rows from top to bottom: 1) separate face spoofing detection with bonafide/spoof for training; 2) separate face forgery detection with bonafide/deepfakes for training; and 3) joint face spoofing and forgery detection with bonafide/spoof/deepfakes for training. Best views in color.
  • Figure 2: Different head settings for multi-task learning with (a) a shared single head for binary classification; (b) a shared single head for 3-class classification; and (c) separate heads for binary classification. Task #1 and #2 indicate face spoofing and forgery detection, respectively.
  • Figure 3: The framework of the two-branch physiological network. The facial time-domain based MSTmap and wavelet transformed time-frequency representation WaveletMap are used as two-branch inputs.
  • Figure 4: Representative visual (RGB faces in the first and fourth rows) samples as well as their rPPG maps (MSTmap niu2020video in the second and fifth rows while wavelet maps in the third and sixth rows) on nine benchmark datasets.
  • Figure 5: Ablation studies of (a)-(d) the joint training architectures with different shared blocks using 2 heads 2 classes setting; and (e)(f) $\theta$ in the weighted normalization fusion.
  • ...and 1 more figures