Table of Contents
Fetching ...

Face Presentation Attack Detection via Content-Adaptive Spatial Operators

Shujaat Khan

TL;DR

CASO-PAD is proposed, an RGB-only, single-frame model that enhances MobileNetV3 with content-adaptive spatial operators (involution) to better capture localized spoof cues, and provides a practical pathway for robust, on-device FacePAD with mobile-class compute and without auxiliary sensors or temporal stacks.

Abstract

Face presentation attack detection (FacePAD) is critical for securing facial authentication against print, replay, and mask-based spoofing. This paper proposes CASO-PAD, an RGB-only, single-frame model that enhances MobileNetV3 with content-adaptive spatial operators (involution) to better capture localized spoof cues. Unlike spatially shared convolution kernels, the proposed operator generates location-specific, channel-shared kernels conditioned on the input, improving spatial selectivity with minimal overhead. CASO-PAD remains lightweight (3.6M parameters; 0.64 GFLOPs at $256\times256$) and is trained end-to-end using a standard binary cross-entropy objective. Extensive experiments on Replay-Attack, Replay-Mobile, ROSE-Youtu, and OULU-NPU demonstrate strong performance, achieving 100/100/98.9/99.7\% test accuracy, AUC of 1.00/1.00/0.9995/0.9999, and HTER of 0.00/0.00/0.82/0.44\%, respectively. On the large-scale SiW-Mv2 Protocol-1 benchmark, CASO-PAD further attains 95.45\% accuracy with 3.11\% HTER and 3.13\% EER, indicating improved robustness under diverse real-world attacks. Ablation studies show that placing the adaptive operator near the network head and using moderate group sharing yields the best accuracy--efficiency balance. Overall, CASO-PAD provides a practical pathway for robust, on-device FacePAD with mobile-class compute and without auxiliary sensors or temporal stacks.

Face Presentation Attack Detection via Content-Adaptive Spatial Operators

TL;DR

CASO-PAD is proposed, an RGB-only, single-frame model that enhances MobileNetV3 with content-adaptive spatial operators (involution) to better capture localized spoof cues, and provides a practical pathway for robust, on-device FacePAD with mobile-class compute and without auxiliary sensors or temporal stacks.

Abstract

Face presentation attack detection (FacePAD) is critical for securing facial authentication against print, replay, and mask-based spoofing. This paper proposes CASO-PAD, an RGB-only, single-frame model that enhances MobileNetV3 with content-adaptive spatial operators (involution) to better capture localized spoof cues. Unlike spatially shared convolution kernels, the proposed operator generates location-specific, channel-shared kernels conditioned on the input, improving spatial selectivity with minimal overhead. CASO-PAD remains lightweight (3.6M parameters; 0.64 GFLOPs at ) and is trained end-to-end using a standard binary cross-entropy objective. Extensive experiments on Replay-Attack, Replay-Mobile, ROSE-Youtu, and OULU-NPU demonstrate strong performance, achieving 100/100/98.9/99.7\% test accuracy, AUC of 1.00/1.00/0.9995/0.9999, and HTER of 0.00/0.00/0.82/0.44\%, respectively. On the large-scale SiW-Mv2 Protocol-1 benchmark, CASO-PAD further attains 95.45\% accuracy with 3.11\% HTER and 3.13\% EER, indicating improved robustness under diverse real-world attacks. Ablation studies show that placing the adaptive operator near the network head and using moderate group sharing yields the best accuracy--efficiency balance. Overall, CASO-PAD provides a practical pathway for robust, on-device FacePAD with mobile-class compute and without auxiliary sensors or temporal stacks.
Paper Structure (49 sections, 6 equations, 8 figures, 13 tables, 1 algorithm)

This paper contains 49 sections, 6 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: Sample frames from authentic and spoofed videos in (a) RA, (b) RM, (c) OULU-NPU, (d) RY, and (e) SiW-Mv2 datasets. Genuine samples appear in the top row, while the bottom row shows different attack types.
  • Figure 2: Illustration of the adaptive center-cropping process used in the pre-processing pipeline. This operation ensures consistent framing and preserves the original aspect ratio across all samples.
  • Figure 3: Schematic diagram of the proposed content adaptive spatial operator-based deep learning network.
  • Figure 4: Kernel audit visualization (normalized to $[0,1]$). Left: Mean kernel showing directional polarity contrast, resembling an edge-detecting filter. Right: Mean energy distribution, radially compact and indicative of high-pass/edge-emphasizing behavior.
  • Figure 5: Kernel audit overlays on the test set. (a) Real faces show higher directional consistency (anisotropy), while (b) attack samples exhibit excessive sharpness (higher HF/LF) but weaker organization.
  • ...and 3 more figures