Table of Contents
Fetching ...

Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models

Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen

TL;DR

Vision-language pretrained models are vulnerable to textual backdoors injected into text encoders. AMDET presents a model-level detection framework that requires no prior knowledge of data, triggers, or downstream tasks by exploiting a feature assimilation phenomenon and employing gradient-based embedding inversion to recover implicit backdoor features. It also uncovers natural backdoor features in benign models and differentiates them via loss-landscape analysis. Extensive experiments across 3,600 backdoored and benign-finetuned instances show robust detection (F1 ~89–93%), reasonable runtime (~5 minutes on RTX 4090), and strong robustness to adaptive attacks, highlighting AMDET’s practical value for safeguarding vision-language foundations.

Abstract

Vision-language pretrained models (VLPs) such as CLIP have achieved remarkable success, but are also highly vulnerable to backdoor attacks. Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem. Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets, or downstream classifiers, which may be impractical for real-world applications. To address this, To address this challenge, we introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge. Specifically, we first reveal the feature assimilation property in backdoored text encoders: the representations of all tokens within a backdoor sample exhibit a high similarity. Further analysis attributes this effect to the concentration of attention weights on the trigger token. Leveraging this insight, AMDET scans a model by performing gradient-based inversion on token embeddings to recover implicit features that capable of activating backdoor behaviors. Furthermore, we identify the natural backdoor feature in the OpenAI's official CLIP model, which are not intentionally injected but still exhibit backdoor-like behaviors. We then filter them out from real injected backdoor by analyzing their loss landscapes. Extensive experiments on 3,600 backdoored and benign-finetuned models with two attack paradigms and three VLP model structures show that AMDET detects backdoors with an F1 score of 89.90%. Besides, it achieves one complete detection in approximately 5 minutes on a RTX 4090 GPU and exhibits strong robustness against adaptive attacks. Code is available at: https://github.com/Robin-WZQ/AMDET

Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models

TL;DR

Vision-language pretrained models are vulnerable to textual backdoors injected into text encoders. AMDET presents a model-level detection framework that requires no prior knowledge of data, triggers, or downstream tasks by exploiting a feature assimilation phenomenon and employing gradient-based embedding inversion to recover implicit backdoor features. It also uncovers natural backdoor features in benign models and differentiates them via loss-landscape analysis. Extensive experiments across 3,600 backdoored and benign-finetuned instances show robust detection (F1 ~89–93%), reasonable runtime (~5 minutes on RTX 4090), and strong robustness to adaptive attacks, highlighting AMDET’s practical value for safeguarding vision-language foundations.

Abstract

Vision-language pretrained models (VLPs) such as CLIP have achieved remarkable success, but are also highly vulnerable to backdoor attacks. Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem. Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets, or downstream classifiers, which may be impractical for real-world applications. To address this, To address this challenge, we introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge. Specifically, we first reveal the feature assimilation property in backdoored text encoders: the representations of all tokens within a backdoor sample exhibit a high similarity. Further analysis attributes this effect to the concentration of attention weights on the trigger token. Leveraging this insight, AMDET scans a model by performing gradient-based inversion on token embeddings to recover implicit features that capable of activating backdoor behaviors. Furthermore, we identify the natural backdoor feature in the OpenAI's official CLIP model, which are not intentionally injected but still exhibit backdoor-like behaviors. We then filter them out from real injected backdoor by analyzing their loss landscapes. Extensive experiments on 3,600 backdoored and benign-finetuned models with two attack paradigms and three VLP model structures show that AMDET detects backdoors with an F1 score of 89.90%. Besides, it achieves one complete detection in approximately 5 minutes on a RTX 4090 GPU and exhibits strong robustness against adaptive attacks. Code is available at: https://github.com/Robin-WZQ/AMDET

Paper Structure

This paper contains 24 sections, 52 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Backdoored text encoders can exhibit poisoning effects across a variety of downstream tasks.
  • Figure 2: Distribution of $Sim_X$ for 375 benign and 375 backdoor samples on (Left) CLIP CLIP, (Middle) SigLIP SigLIP and (Right) LongCLIP Zhang2024LongCLIPUT. Blue bars denote benign samples and red bars denote backdoor samples, where samples exhibit a clear distributional shift.
  • Figure 3: The self-attention map for the prompt "zzzz a man with glasses" on (a) the benign model and (b) the backdoor model, where zzzz is the trigger. Attention concentrates on the <BOS> token in the benign model, whereas it focuses on the trigger token in the backdoor model.
  • Figure 4: Kernel density estimates of the self-attention weight proportions between the <BOS> token and the trigger token on (a) 375 benign samples and (b) 375 backdoor samples.
  • Figure 5: The evolution of five metrics through the backdoor training steps. All values are normalized to range [0,1] for better comparison.
  • ...and 9 more figures