Table of Contents
Fetching ...

Concept-Guided Backdoor Attack on Vision Language Models

Haoyu Shen, Weimin Lyu, Haotian Xu, Tengfei Ma

TL;DR

The paper identifies a semantic backdoor surface in Vision-Language Models by introducing concept-guided attacks. It presents two complementary approaches: Concept-Thresholding Poisoning (CTP), which poisons only samples containing a target concept, and CGUB, which uses a Concept Bottleneck Model during training to manipulate latent concepts for unseen labels while leaving inference unchanged. Across multiple architectures and datasets, both attacks achieve high attack success with limited impact on clean performance, demonstrating that concept-level representations are a viable and stealthy attack surface. The work highlights the need for defenses that address semantic and latent-space vulnerabilities in multimodal models.

Abstract

Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises increasing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While effective, these approaches reduce stealthiness and remain vulnerable to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than on raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, causing the model to behave normally in all other cases but consistently inject malicious outputs whenever the concept appears. The second, CBL-Guided Unseen Backdoor (CGUB), leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, while discarding the CBM branch at inference time to keep the VLM unchanged. This design enables systematic replacement of a targeted label in generated text (for example, replacing "cat" with "dog"), even when the replacement behavior never appears in the training data. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve high attack success rates while maintaining moderate impact on clean-task performance. These findings highlight concept-level vulnerabilities as a critical new attack surface for VLMs.

Concept-Guided Backdoor Attack on Vision Language Models

TL;DR

The paper identifies a semantic backdoor surface in Vision-Language Models by introducing concept-guided attacks. It presents two complementary approaches: Concept-Thresholding Poisoning (CTP), which poisons only samples containing a target concept, and CGUB, which uses a Concept Bottleneck Model during training to manipulate latent concepts for unseen labels while leaving inference unchanged. Across multiple architectures and datasets, both attacks achieve high attack success with limited impact on clean performance, demonstrating that concept-level representations are a viable and stealthy attack surface. The work highlights the need for defenses that address semantic and latent-space vulnerabilities in multimodal models.

Abstract

Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises increasing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While effective, these approaches reduce stealthiness and remain vulnerable to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than on raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, causing the model to behave normally in all other cases but consistently inject malicious outputs whenever the concept appears. The second, CBL-Guided Unseen Backdoor (CGUB), leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, while discarding the CBM branch at inference time to keep the VLM unchanged. This design enables systematic replacement of a targeted label in generated text (for example, replacing "cat" with "dog"), even when the replacement behavior never appears in the training data. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve high attack success rates while maintaining moderate impact on clean-task performance. These findings highlight concept-level vulnerabilities as a critical new attack surface for VLMs.

Paper Structure

This paper contains 53 sections, 7 equations, 13 figures, 22 tables.

Figures (13)

  • Figure 1: Illustration of concept-guided backdoor attacks. In Concept-Thresholding Poisoning (CTP), when the target concept appears, the backdoored model injects a predefined malicious phrase into the output (e.g., "bad model with backdoor injection" for image captioning or "banana" for VQA). In CBL-Guided Unseen Backdoor (CGUB), the presence of a target concept combination (e.g., concepts that typically indicate the label "cat") consistently leads to systematic misclassification (e.g., cat $\rightarrow$ dog), even though no training data containing the target label were used for backdoor injection.
  • Figure 2: Concept-Thresholding Poisoning Attack Framework. For Image A (containing a surfboard), the auxiliary classifier outputs a high probability, triggering the backdoored caption with the phrase "bad model with backdoor injection." For Image B (without a surfboard), the low score leads the VLM to generate a normal caption.
  • Figure 3: Framework of the CBL-Guided Unseen Backdoor (CGUB) Attack. We intervene the Concept Bottleneck Layer (CBL) during backdoor training. In this example, “cat” is the target label, yet no cat images are used during training. Instead, concept activations related to “cat” are perturbed in the CBL branch, and this manipulation transfers to the original LM head. At test time, we only keep the original VLM, without the CBL. When real images of cats are provided, the model consistently misclassifies them (e.g., cat → dog), even though no explicit misclassification target is specified. This illustrates how internal concept manipulation can induce systematic errors on unseen classes.
  • Figure 4: Attack success rates (ASR) after applying an autoencoder-based defense to backdoored models trained on Flickr8K, Flickr30K, and COCO. All image-trigger-based attacks collapse under distortion, while our method remains robust.
  • Figure 5: Grad-CAM visualization of the last layer in the multimodal adapter of LLaVA-v1.5-7B. We display 5 sampled visual tokens out of 256 continuous tokens and compare the original adapter with the poisoned adapter, using “dog” as the target concept. More examples in Appx. \ref{['sec:gradcam_appendix']}.
  • ...and 8 more figures