Table of Contents
Fetching ...

Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection

Haoze Li, Jie Zhang, Guoying Zhao, Stephen Lin, Shiguang Shan

TL;DR

This work tackles the challenge of deploying robust face presentation attack detection under rehearsal-free incremental learning, constrained by privacy regulations that prevent storing past data. It introduces SVLP-IL, a CLIP-based framework that steers vision-language pre-trained models via Multi-Aspect Prompting (MAP) and Selective Elastic Weight Consolidation (SEWC) to balance plasticity and stability without data replay. MAP provides domain-specific and universal cues through visual and textual prompts, while SEWC protects critical backbone weights by selectively consolidating past knowledge with a Bayesian-inspired Fisher-based penalty. Extensive experiments across nine PAD benchmarks demonstrate reduced forgetting and strong generalization to unseen domains, offering a practical, privacy-conscious approach to lifelong PAD deployment.

Abstract

Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbf{SVLP-IL}, a VLP-based RF-IL framework that balances stability and plasticity via \textit{Multi-Aspect Prompting} (MAP) and \textit{Selective Elastic Weight Consolidation} (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.

Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection

TL;DR

This work tackles the challenge of deploying robust face presentation attack detection under rehearsal-free incremental learning, constrained by privacy regulations that prevent storing past data. It introduces SVLP-IL, a CLIP-based framework that steers vision-language pre-trained models via Multi-Aspect Prompting (MAP) and Selective Elastic Weight Consolidation (SEWC) to balance plasticity and stability without data replay. MAP provides domain-specific and universal cues through visual and textual prompts, while SEWC protects critical backbone weights by selectively consolidating past knowledge with a Bayesian-inspired Fisher-based penalty. Extensive experiments across nine PAD benchmarks demonstrate reduced forgetting and strong generalization to unseen domains, offering a practical, privacy-conscious approach to lifelong PAD deployment.

Abstract

Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbf{SVLP-IL}, a VLP-based RF-IL framework that balances stability and plasticity via \textit{Multi-Aspect Prompting} (MAP) and \textit{Selective Elastic Weight Consolidation} (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.

Paper Structure

This paper contains 32 sections, 21 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of training data access assumptions among Traditional Machine Learning, Experience Replay Incremental Learning (ER-IL), and Rehearsal-Free Incremental Learning (RF-IL). (a) Traditional Machine Learning assumes simultaneous access to data from multiple domains. (b) ER-IL assumes that part of the data from previous domains can be used. (c) In contrast, RF-IL can only access data from one domain at a time, and data from previous domains is not accessible.
  • Figure 2: Overview of the SVLP-IL framework. The framework learns incrementally by balancing adaptation and stability. MAP adapts the model to new domains by learning a structured set of visual and textual prompts that capture both domain-specific and universal spoofing cues. Concurrently, SEWC preserves existing knowledge by protecting critical parameters in the shared backbone from being overwritten. During inference, a lightweight prototype-based router selects the appropriate prompts for a given test image.
  • Figure 3: Long-sequence incremental learning performance in terms of AUC on Protocol-4, which simulates progressively complex attack types. Subplots (MSU-MFSD to WFFD) illustrate the performance trajectory on each source domain as new domains are sequentially learned. The last subplot (CelebA-Spoof) presents the final generalization performance on an unseen diverse attack dataset after completing the 8-stage incremental training.
  • Figure 4: Ablation study of Multi-Aspect Prompting (MAP) components on the long-sequence Protocol-4. We report the average $\Delta m\%$ over the 8 incremental steps. Lower values indicate better retention of past knowledge.
  • Figure 5: Ablation study on the selection ratio $p$ in SEWC. Performance is evaluated in terms of HTER on WFFD (last domain) and MSU-MFSD (first domain) after sequential training on eight domains (Protocol-4).
  • ...and 1 more figures