Table of Contents
Fetching ...

G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing

Jingyi Yang, Zitong Yu, Xiuming Ni, Jia He, Hui Li

TL;DR

The paper addresses video-based face anti-spoofing by integrating photometric cues from facial images with dynamic motion cues from facial landmarks. It introduces the Graph Guided Video Vision Transformer (G^2V^2former), a two-stream framework that factorizes spatial and temporal attention, uses topology-aware spatial attention and a Kronecker temporal attention to capture broad temporal dynamics, and leverages landmark motion to guide pixel-level motion capture via graph-temporal guidance. A photometric consistency loss $L_{ ext{PCL}}$ enforces frame-to-frame similarity while treating live and spoof samples differently to reflect photometric homogeneity vs. heterogeneity. Through two-stage self-supervised pretraining and rigorous cross-dataset/cross-type experiments across nine datasets, the method demonstrates strong generalization, outperforming many state-of-the-art approaches and showing robust performance against high-photorealism 3D masks. Overall, this work advances video-based FAS by jointly exploiting photometric and motion cues with graph-structured guidance for improved detection.

Abstract

In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality, even a combination of both. Prevailing face anti-spoofing (FAS) approaches generally concentrate on the single-frame scenario, however, purely photometric-driven methods overlook the dynamic spoofing clues that may be exposed over time. This may lead FAS systems to conclude incorrect judgments, especially in cases where it is easily distinguishable in terms of dynamics but challenging to discern in terms of photometrics. To this end, we propose the Graph Guided Video Vision Transformer (G$^2$V$^2$former), which combines faces with facial landmarks for photometric and dynamic feature fusion. We factorize the attention into space and time, and fuse them via a spatiotemporal block. Specifically, we design a novel temporal attention called Kronecker temporal attention, which has a wider receptive field, and is beneficial for capturing dynamic information. Moreover, we leverage the low-semantic motion of facial landmarks to guide the high-semantic change of facial expressions based on the motivation that regions containing landmarks may reveal more dynamic clues. Extensive experiments on nine benchmark datasets demonstrate that our method achieves superior performance under various scenarios. The codes will be released soon.

G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing

TL;DR

The paper addresses video-based face anti-spoofing by integrating photometric cues from facial images with dynamic motion cues from facial landmarks. It introduces the Graph Guided Video Vision Transformer (G^2V^2former), a two-stream framework that factorizes spatial and temporal attention, uses topology-aware spatial attention and a Kronecker temporal attention to capture broad temporal dynamics, and leverages landmark motion to guide pixel-level motion capture via graph-temporal guidance. A photometric consistency loss enforces frame-to-frame similarity while treating live and spoof samples differently to reflect photometric homogeneity vs. heterogeneity. Through two-stage self-supervised pretraining and rigorous cross-dataset/cross-type experiments across nine datasets, the method demonstrates strong generalization, outperforming many state-of-the-art approaches and showing robust performance against high-photorealism 3D masks. Overall, this work advances video-based FAS by jointly exploiting photometric and motion cues with graph-structured guidance for improved detection.

Abstract

In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality, even a combination of both. Prevailing face anti-spoofing (FAS) approaches generally concentrate on the single-frame scenario, however, purely photometric-driven methods overlook the dynamic spoofing clues that may be exposed over time. This may lead FAS systems to conclude incorrect judgments, especially in cases where it is easily distinguishable in terms of dynamics but challenging to discern in terms of photometrics. To this end, we propose the Graph Guided Video Vision Transformer (GVformer), which combines faces with facial landmarks for photometric and dynamic feature fusion. We factorize the attention into space and time, and fuse them via a spatiotemporal block. Specifically, we design a novel temporal attention called Kronecker temporal attention, which has a wider receptive field, and is beneficial for capturing dynamic information. Moreover, we leverage the low-semantic motion of facial landmarks to guide the high-semantic change of facial expressions based on the motivation that regions containing landmarks may reveal more dynamic clues. Extensive experiments on nine benchmark datasets demonstrate that our method achieves superior performance under various scenarios. The codes will be released soon.
Paper Structure (17 sections, 20 equations, 7 figures, 11 tables)

This paper contains 17 sections, 20 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Most common methods in FAS foucs on frame-level spoofing representation. In contrast, our method aim to fuse photometric and dynamic spoofing clues. When the testing sample shows abnormalities in any dimension (space or time), it should be detected as a spoof.
  • Figure 2: (a) The topology-aware spatial attention. Connections between two nodes will be blocked (masked in attention matrix) if they are too far apart in the topology. (b) Various visual spatiotemporal attention. (c) Common spatial attention can be seen as applying a mask to joint attention. (d) Kronecker temporal attention can also be obtained by employing a tailored mask to joint attention. Spatial attention and Kronecker temporal attention exhibit spatiotemporal complementarity.
  • Figure 3: The architecture of graph guided video vision transformer. It requires inputting a facial video clip into the vision stream and the corresponding facial landmarks into the graph stream. The visual spatial attention is guided by photometric consistency loss, while the visual temporal attention is guided by graphic temporal attention via scatter and add operation. We concatenate the head token of both stream for classification, and optimized it through cross-entropy loss.
  • Figure 4: Graph-guided vision temporal attention. First, performing a scatter operation to align the shape of the graphic temporal attention matrix with that of the visual temporal attention matrix, and then adding them together.
  • Figure 5: Feature distribution visualized by t-SNE. (Left) Cross-domain scenario, (Right) Cross-domain and unseen attack scenario.
  • ...and 2 more figures