Table of Contents
Fetching ...

Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing

Xun Lin, Shuai Wang, Rizhao Cai, Yizhong Liu, Ying Fu, Zitong Yu, Wenzhong Tang, Alex Kot

TL;DR

This paper tackles the challenge of generalizing multi-modal face anti-spoofing (FAS) to unseen environments by addressing modality unreliability during cross-modal fusion and modality imbalance across modalities. It introduces MMDG, a Vision Transformer–based framework with Uncertainty-Guided Cross-Adapters (U-Adapter) to suppress unreliable information and Rebalanced Modality Gradient Modulation (ReGrad) to balance modality convergence, complemented by a Single-Side Prototypical Loss to align domain prototypes. A first large-scale benchmark for multi-modal DG in FAS is proposed, spanning four datasets (CASIA-CeFA, PADISI-Face, CASIA-SURF, WMCA) and three evaluation protocols, with experiments showing state-of-the-art improvements over existing DG and multi-modal methods. The work provides practical insights for deploying robust multi-modal FAS systems under domain shifts and establishes a challenging benchmark to spur further progress, with code and protocols to be released.

Abstract

Face Anti-Spoofing (FAS) is crucial for securing face recognition systems against presentation attacks. With advancements in sensor manufacture and multi-modal learning techniques, many multi-modal FAS approaches have emerged. However, they face challenges in generalizing to unseen attacks and deployment conditions. These challenges arise from (1) modality unreliability, where some modality sensors like depth and infrared undergo significant domain shifts in varying environments, leading to the spread of unreliable information during cross-modal feature fusion, and (2) modality imbalance, where training overly relies on a dominant modality hinders the convergence of others, reducing effectiveness against attack types that are indistinguishable sorely using the dominant modality. To address modality unreliability, we propose the Uncertainty-Guided Cross-Adapter (U-Adapter) to recognize unreliably detected regions within each modality and suppress the impact of unreliable regions on other modalities. For modality imbalance, we propose a Rebalanced Modality Gradient Modulation (ReGrad) strategy to rebalance the convergence speed of all modalities by adaptively adjusting their gradients. Besides, we provide the first large-scale benchmark for evaluating multi-modal FAS performance under domain generalization scenarios. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. Source code and protocols will be released on https://github.com/OMGGGGG/mmdg.

Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing

TL;DR

This paper tackles the challenge of generalizing multi-modal face anti-spoofing (FAS) to unseen environments by addressing modality unreliability during cross-modal fusion and modality imbalance across modalities. It introduces MMDG, a Vision Transformer–based framework with Uncertainty-Guided Cross-Adapters (U-Adapter) to suppress unreliable information and Rebalanced Modality Gradient Modulation (ReGrad) to balance modality convergence, complemented by a Single-Side Prototypical Loss to align domain prototypes. A first large-scale benchmark for multi-modal DG in FAS is proposed, spanning four datasets (CASIA-CeFA, PADISI-Face, CASIA-SURF, WMCA) and three evaluation protocols, with experiments showing state-of-the-art improvements over existing DG and multi-modal methods. The work provides practical insights for deploying robust multi-modal FAS systems under domain shifts and establishes a challenging benchmark to spur further progress, with code and protocols to be released.

Abstract

Face Anti-Spoofing (FAS) is crucial for securing face recognition systems against presentation attacks. With advancements in sensor manufacture and multi-modal learning techniques, many multi-modal FAS approaches have emerged. However, they face challenges in generalizing to unseen attacks and deployment conditions. These challenges arise from (1) modality unreliability, where some modality sensors like depth and infrared undergo significant domain shifts in varying environments, leading to the spread of unreliable information during cross-modal feature fusion, and (2) modality imbalance, where training overly relies on a dominant modality hinders the convergence of others, reducing effectiveness against attack types that are indistinguishable sorely using the dominant modality. To address modality unreliability, we propose the Uncertainty-Guided Cross-Adapter (U-Adapter) to recognize unreliably detected regions within each modality and suppress the impact of unreliable regions on other modalities. For modality imbalance, we propose a Rebalanced Modality Gradient Modulation (ReGrad) strategy to rebalance the convergence speed of all modalities by adaptively adjusting their gradients. Besides, we provide the first large-scale benchmark for evaluating multi-modal FAS performance under domain generalization scenarios. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. Source code and protocols will be released on https://github.com/OMGGGGG/mmdg.
Paper Structure (12 sections, 8 equations, 7 figures, 5 tables)

This paper contains 12 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of DG scenarios in the context of (a) unimodal and (b) multi-modal. (c) DG performance of SSDG ssdg on our Protocol 1 (see Sec. \ref{['sec:bench']}). Though fed with more modalities, SSDG performs worse in multi-modal scenarios compared to unimodal ones. * denotes our re-implemented multi-modal version.
  • Figure 2: Overall structure of our MMDG. It consists of ViT-based backbones fine-tuned by the proposed U-Adapter s with the modality rebalancing strategy called ReGrad. Each modality is designed with a branch for feature extraction and enables feature interaction and mutual complementarity with other modalities. For simplicity, we illustrate the two-modality scenario.
  • Figure 3: (a1)-(a2) Illustration of fine-tuning ViT with proposed U-Adapters, showcasing the interaction between the RGB (R) and Depth (D) modalities. Note that only parameters of U-Adapters are trainable. (b) Uncertainty Estimation Module (UEM) used for recognizing unreliable tokens. (c) Detailed structure of U-Adapter, which adopts cross-modal fusion and suppresses the interference of unreliable tokens on other modalities. After fusion, discriminative central difference information is integrated for fine-grained spoof representation.
  • Figure 4: Illustration of gradient modulation via the proposed ReGrad in different scenarios: (Row 1) Non-conflicting (a1) and faster modality $j$ (b1) or $i$ (c1). (Row 2) Conflicting (a2) and faster modality $j$ (b2) or $i$ (c2).
  • Figure 5: Ablation results on our U-Adapter. We report average HTER $\downarrow$ and AUC $\uparrow$ on four sub-protocols in Protocol 1.
  • ...and 2 more figures