G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing
Jingyi Yang, Zitong Yu, Xiuming Ni, Jia He, Hui Li
TL;DR
The paper addresses video-based face anti-spoofing by integrating photometric cues from facial images with dynamic motion cues from facial landmarks. It introduces the Graph Guided Video Vision Transformer (G^2V^2former), a two-stream framework that factorizes spatial and temporal attention, uses topology-aware spatial attention and a Kronecker temporal attention to capture broad temporal dynamics, and leverages landmark motion to guide pixel-level motion capture via graph-temporal guidance. A photometric consistency loss $L_{ ext{PCL}}$ enforces frame-to-frame similarity while treating live and spoof samples differently to reflect photometric homogeneity vs. heterogeneity. Through two-stage self-supervised pretraining and rigorous cross-dataset/cross-type experiments across nine datasets, the method demonstrates strong generalization, outperforming many state-of-the-art approaches and showing robust performance against high-photorealism 3D masks. Overall, this work advances video-based FAS by jointly exploiting photometric and motion cues with graph-structured guidance for improved detection.
Abstract
In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality, even a combination of both. Prevailing face anti-spoofing (FAS) approaches generally concentrate on the single-frame scenario, however, purely photometric-driven methods overlook the dynamic spoofing clues that may be exposed over time. This may lead FAS systems to conclude incorrect judgments, especially in cases where it is easily distinguishable in terms of dynamics but challenging to discern in terms of photometrics. To this end, we propose the Graph Guided Video Vision Transformer (G$^2$V$^2$former), which combines faces with facial landmarks for photometric and dynamic feature fusion. We factorize the attention into space and time, and fuse them via a spatiotemporal block. Specifically, we design a novel temporal attention called Kronecker temporal attention, which has a wider receptive field, and is beneficial for capturing dynamic information. Moreover, we leverage the low-semantic motion of facial landmarks to guide the high-semantic change of facial expressions based on the motivation that regions containing landmarks may reveal more dynamic clues. Extensive experiments on nine benchmark datasets demonstrate that our method achieves superior performance under various scenarios. The codes will be released soon.
