Table of Contents
Fetching ...

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Juncheng Li, Yige Li, Hanxun Huang, Yunhao Chen, Xin Wang, Yixu Wang, Xingjun Ma, Yu-Gang Jiang

TL;DR

BackdoorVLM introduces the first unified benchmark for evaluating backdoor threats in vision-language models, spanning textual, visual, and bimodal triggers across five target categories and twelve attacks. The study reveals strong susceptibility to textual triggers, with poisoning rates as low as 1% yielding high attack success, and shows bimodal triggers often rely on the text modality despite multimodal training. It provides a comprehensive experimental framework using two open-source VLMs and three datasets, offering insights into attack efficacy, transferability, and the trade-off between backdoor strength and model utility. The benchmark aims to enable reproducible evaluation and defense development, with code and data released to stimulate further research in mitigating multimodal backdoor threats.

Abstract

Backdoor attacks undermine the reliability and trustworthiness of machine learning systems by injecting hidden behaviors that can be maliciously activated at inference time. While such threats have been extensively studied in unimodal settings, their impact on multimodal foundation models, particularly vision-language models (VLMs), remains largely underexplored. In this work, we introduce \textbf{BackdoorVLM}, the first comprehensive benchmark for systematically evaluating backdoor attacks on VLMs across a broad range of settings. It adopts a unified perspective that injects and analyzes backdoors across core vision-language tasks, including image captioning and visual question answering. BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack. Each category captures a distinct pathway through which an adversary can manipulate a model's behavior. We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets. Our analysis reveals that VLMs exhibit strong sensitivity to textual instructions, and in bimodal backdoors the text trigger typically overwhelms the image trigger when forming the backdoor mapping. Notably, backdoors involving the textual modality remain highly potent, with poisoning rates as low as 1\% yielding over 90\% success across most tasks. These findings highlight significant, previously underexplored vulnerabilities in current VLMs. We hope that BackdoorVLM can serve as a useful benchmark for analyzing and mitigating multimodal backdoor threats. Code is available at: https://github.com/bin015/BackdoorVLM .

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

TL;DR

BackdoorVLM introduces the first unified benchmark for evaluating backdoor threats in vision-language models, spanning textual, visual, and bimodal triggers across five target categories and twelve attacks. The study reveals strong susceptibility to textual triggers, with poisoning rates as low as 1% yielding high attack success, and shows bimodal triggers often rely on the text modality despite multimodal training. It provides a comprehensive experimental framework using two open-source VLMs and three datasets, offering insights into attack efficacy, transferability, and the trade-off between backdoor strength and model utility. The benchmark aims to enable reproducible evaluation and defense development, with code and data released to stimulate further research in mitigating multimodal backdoor threats.

Abstract

Backdoor attacks undermine the reliability and trustworthiness of machine learning systems by injecting hidden behaviors that can be maliciously activated at inference time. While such threats have been extensively studied in unimodal settings, their impact on multimodal foundation models, particularly vision-language models (VLMs), remains largely underexplored. In this work, we introduce \textbf{BackdoorVLM}, the first comprehensive benchmark for systematically evaluating backdoor attacks on VLMs across a broad range of settings. It adopts a unified perspective that injects and analyzes backdoors across core vision-language tasks, including image captioning and visual question answering. BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack. Each category captures a distinct pathway through which an adversary can manipulate a model's behavior. We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets. Our analysis reveals that VLMs exhibit strong sensitivity to textual instructions, and in bimodal backdoors the text trigger typically overwhelms the image trigger when forming the backdoor mapping. Notably, backdoors involving the textual modality remain highly potent, with poisoning rates as low as 1\% yielding over 90\% success across most tasks. These findings highlight significant, previously underexplored vulnerabilities in current VLMs. We hope that BackdoorVLM can serve as a useful benchmark for analyzing and mitigating multimodal backdoor threats. Code is available at: https://github.com/bin015/BackdoorVLM .

Paper Structure

This paper contains 54 sections, 4 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustration of the 5 backdoor categories in BackdoorVLM, and a backdoored VLM that performs normally on clean inputs yet switches to attacker-specified behaviors when exposed to unimodal (text or image) or bimodal triggers.
  • Figure 2: Grad-CAM visualizations on two clean images under Targeted Refusal. For each image, we compare attention heatmaps produced by the clean model and the corresponding backdoored model when applying BadNets-I and Blended triggers.
  • Figure 3: Examples of text triggers.
  • Figure 4: Examples of image triggers.
  • Figure 5: Examples of bimodal triggers.
  • ...and 4 more figures