Table of Contents
Fetching ...

La-SoftMoE CLIP for Unified Physical-Digital Face Attack Detection

Hang Zou, Chenxi Du, Hui Zhang, Yuan Zhang, Ajian Liu, Jun Wan, Zhen Lei

TL;DR

This work tackles unified detection of physical and digital face attacks by addressing sparse feature distributions that hinder CLIP's performance. It introduces La-SoftMoE CLIP, embedding a Soft MoE module with linear attention into CLIP's image encoder to better handle diverse attack types within a single model. Empirical results on ID-consistent UniAttackData show state-of-the-art ACER and ACC, with competitive AUC and EER, while analyses reveal prompt sensitivity and substantial ablation-supported gains over vanilla CLIP. The approach advances practical UAD by enabling a single, adaptable model that can generalize across attack modalities, with promising directions for further generalization and prompt optimization.

Abstract

Facial recognition systems are susceptible to both physical and digital attacks, posing significant security risks. Traditional approaches often treat these two attack types separately due to their distinct characteristics. Thus, when being combined attacked, almost all methods could not deal. Some studies attempt to combine the sparse data from both types of attacks into a single dataset and try to find a common feature space, which is often impractical due to the space is difficult to be found or even non-existent. To overcome these challenges, we propose a novel approach that uses the sparse model to handle sparse data, utilizing different parameter groups to process distinct regions of the sparse feature space. Specifically, we employ the Mixture of Experts (MoE) framework in our model, expert parameters are matched to tokens with varying weights during training and adaptively activated during testing. However, the traditional MoE struggles with the complex and irregular classification boundaries of this problem. Thus, we introduce a flexible self-adapting weighting mechanism, enabling the model to better fit and adapt. In this paper, we proposed La-SoftMoE CLIP, which allows for more flexible adaptation to the Unified Attack Detection (UAD) task, significantly enhancing the model's capability to handle diversity attacks. Experiment results demonstrate that our proposed method has SOTA performance.

La-SoftMoE CLIP for Unified Physical-Digital Face Attack Detection

TL;DR

This work tackles unified detection of physical and digital face attacks by addressing sparse feature distributions that hinder CLIP's performance. It introduces La-SoftMoE CLIP, embedding a Soft MoE module with linear attention into CLIP's image encoder to better handle diverse attack types within a single model. Empirical results on ID-consistent UniAttackData show state-of-the-art ACER and ACC, with competitive AUC and EER, while analyses reveal prompt sensitivity and substantial ablation-supported gains over vanilla CLIP. The approach advances practical UAD by enabling a single, adaptable model that can generalize across attack modalities, with promising directions for further generalization and prompt optimization.

Abstract

Facial recognition systems are susceptible to both physical and digital attacks, posing significant security risks. Traditional approaches often treat these two attack types separately due to their distinct characteristics. Thus, when being combined attacked, almost all methods could not deal. Some studies attempt to combine the sparse data from both types of attacks into a single dataset and try to find a common feature space, which is often impractical due to the space is difficult to be found or even non-existent. To overcome these challenges, we propose a novel approach that uses the sparse model to handle sparse data, utilizing different parameter groups to process distinct regions of the sparse feature space. Specifically, we employ the Mixture of Experts (MoE) framework in our model, expert parameters are matched to tokens with varying weights during training and adaptively activated during testing. However, the traditional MoE struggles with the complex and irregular classification boundaries of this problem. Thus, we introduce a flexible self-adapting weighting mechanism, enabling the model to better fit and adapt. In this paper, we proposed La-SoftMoE CLIP, which allows for more flexible adaptation to the Unified Attack Detection (UAD) task, significantly enhancing the model's capability to handle diversity attacks. Experiment results demonstrate that our proposed method has SOTA performance.
Paper Structure (12 sections, 7 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 12 sections, 7 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: This figure shows the distribution of the UniAttackData fang2024unified. It has significant clustering and connectivity due to the ID consistency but has a big gap between PAs and DAs in the feature space.
  • Figure 2: This figure shows the overall framework of our method. The left of the figure demonstrates the basic block structure of the image encoder of the CLIP clip, the whole image encoder consists of 12 Transformer Blocks. Among that, the MoE Module is added in parallel with MLP and has two Linear layers at the input and output. Furthermore, we replaced Soft MoE's token querying mechanism with linear attention to allow for more flexible self-adapting to the sparse feature distribution of UAD.
  • Figure 3: This figure shows a part of the dimension transformation of La-SoftMoE. The number of experts is 4, of slots 49, and features 768. Dispatch Weights maps the input tokens into slots for different experts by weighting all tokens. Combine Weights query the corresponding tokens of slots. We improve the query manner of output tokens by replacing the softmax of combined weights with linear attention and an Instance Norm layer. The Instance Norm maps the weights into the range (0,1).
  • Figure 4: This figure shows the distribution of the UniAttackData fang2024unified (left) and the JFSFDB yu2024benchmarking (right). UniAttackData has significant clustering and connectivity, while JFSFDB is more scattered with no significant clustering trend.
  • Figure 5: The performance of our model on the selected prompts. Their corresponding context is shown in Table \ref{['setting_prompts']}. The left part shows the ACER and EER of using these sentences while the right part shows the ACC and AUC.
  • ...and 1 more figures