Table of Contents
Fetching ...

Mixture-of-Attack-Experts with Class Regularization for Unified Physical-Digital Face Attack Detection

Shunxin Chen, Ajian Liu, Junze Zheng, Jun Wan, Kailai Peng, Sergio Escalera, Zhen Lei

TL;DR

This work tackles the challenge of unified detection for physical and digital face attacks, where live and fake faces are hard to separate due to large intra-class variation. It introduces MoAE-CR, combining a Soft Mixture of Experts integrated into a CLIP-based image encoder with two class-regularization modules, DM and CDM, to promote intra-class cohesion and inter-class separation while weighting distant, hard examples more heavily. The method trains with a CLIP objective plus the DM and CDM losses, and during inference MoAEs adaptively route features to specialized experts. Extensive experiments on UniAttackData and JFSFDB show state-of-the-art performance and strong generalization to unseen attacks, with ablations confirming the synergistic benefits of DM and CDM.

Abstract

Facial recognition systems in real-world scenarios are susceptible to both digital and physical attacks. Previous methods have attempted to achieve classification by learning a comprehensive feature space. However, these methods have not adequately accounted for the inherent characteristics of physical and digital attack data, particularly the large intra class variation in attacks and the small inter-class variation between live and fake faces. To address these limitations, we propose the Fine-Grained MoE with Class-Aware Regularization CLIP framework (FG-MoE-CLIP-CAR), incorporating key improvements at both the feature and loss levels. At the feature level, we employ a Soft Mixture of Experts (Soft MoE) architecture to leverage different experts for specialized feature processing. Additionally, we refine the Soft MoE to capture more subtle differences among various types of fake faces. At the loss level, we introduce two constraint modules: the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their respective class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Experimental results on two unified physical-digital attack datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance.

Mixture-of-Attack-Experts with Class Regularization for Unified Physical-Digital Face Attack Detection

TL;DR

This work tackles the challenge of unified detection for physical and digital face attacks, where live and fake faces are hard to separate due to large intra-class variation. It introduces MoAE-CR, combining a Soft Mixture of Experts integrated into a CLIP-based image encoder with two class-regularization modules, DM and CDM, to promote intra-class cohesion and inter-class separation while weighting distant, hard examples more heavily. The method trains with a CLIP objective plus the DM and CDM losses, and during inference MoAEs adaptively route features to specialized experts. Extensive experiments on UniAttackData and JFSFDB show state-of-the-art performance and strong generalization to unseen attacks, with ablations confirming the synergistic benefits of DM and CDM.

Abstract

Facial recognition systems in real-world scenarios are susceptible to both digital and physical attacks. Previous methods have attempted to achieve classification by learning a comprehensive feature space. However, these methods have not adequately accounted for the inherent characteristics of physical and digital attack data, particularly the large intra class variation in attacks and the small inter-class variation between live and fake faces. To address these limitations, we propose the Fine-Grained MoE with Class-Aware Regularization CLIP framework (FG-MoE-CLIP-CAR), incorporating key improvements at both the feature and loss levels. At the feature level, we employ a Soft Mixture of Experts (Soft MoE) architecture to leverage different experts for specialized feature processing. Additionally, we refine the Soft MoE to capture more subtle differences among various types of fake faces. At the loss level, we introduce two constraint modules: the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their respective class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Experimental results on two unified physical-digital attack datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance.

Paper Structure

This paper contains 23 sections, 14 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison with existing methods. Greater overlap in histograms indicates poor class separation. (a) Previous methods focus on feature mining but overlook intra-class and inter-class variations. (b) Our method refines features and enforces constraints, achieving a more distinct and separable feature space.
  • Figure 2: Comparison of popular class constraint methods and our approach. Small nodes represent the features of batch data. The connections defined by the loss are represented by edges, with thicker edges indicating larger gradients. (a) The N-pair loss reflects the hardness of the data but does not utilize all the data in the batch. (b) The triplet loss does not account for data hardness. The aggressive pushing mechanisms utilized by both (a) and (b) can lead to unintended class separation. Such forceful displacement may cause certain points, particularly green points, to diverge from their respective class clusters. (c) Our method considers all data in the batch, processes them with class centers, and simultaneously avoids class separation phenomena.
  • Figure 3: Our proposed MoAE-CR framework. This article primarily utilizes the Uniattack Datasetoder, and is designed to adapt to joint physical and digital attack tasks through contributions at two levels: (1) The image encoder incorporates MoAEs, which are composed of $m$ Transformer Blocks. Our MoAEs facilitate more nuanced learning from multiple perspectives, resulting in superior feature representation. (2) Two constraint modules: the Disentanglement Module (DM) and the Cluster Distill Module (CDM). These modules maximize intra-class cohesion and inter-class separation between live and fake faces.
  • Figure 4: The detailed implementation structures of the Mixture-of-Attack-Experts (MoAEs), Disentanglement Module (DM), and Cluster Distill Module (CDM) are as follows. The MoAEs build upon the Soft MoEs by incorporating a multi-head attention mechanism to enhance feature processing. The DM utilizes a relationship matrix based on class centers to increase the distance between the centers of different classes. The CDM leverages this relationship matrix to bring each feature closer to its corresponding class center while distancing it from other class centers. Both DM and CDM employ the Log-Sum-Exp (LSE) function to prioritize more distant features.
  • Figure 5: The figure presents the feature distribution visualization analysis of UniAttackData using the following methods: vanilla CLIP (top left), CLIP with SoftMoE (top center), CLIP with MoAE (top right), MoAE with DM (bottom left), MoAE with CDM (bottom center), and MoAE-CR (bottom right).