Table of Contents
Fetching ...

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang

TL;DR

This paper proposes IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack on VLM-based visual grounding.

Abstract

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

TL;DR

This paper proposes IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack on VLM-based visual grounding.

Abstract

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

Paper Structure

This paper contains 39 sections, 9 equations, 13 figures, 13 tables, 1 algorithm.

Figures (13)

  • Figure 1: Threat raised by proposed IAG attack. When the compromised VLM encounters the trigger, it grounds the attacker-chosen target regions or objects (in red box, e.g., "Play Now", "Buy Membership", "hands") irrespective of the benign user query, thereby misleading the VLM’s intended grounding behavior. The attack targets vary significantly across different images.
  • Figure 2: Overall framework of the proposed IAG. First, the trigger generator (text-conditioned UNet) generates a trigger based on the benign image and text guidance of any attack target object in the image by the frozen benign embedding layer. The trigger is a gray-looking pattern, whose size is the same as the benign image's. Second, the trigger is added onto the benign image to construct a triggered image. Then it is fed into the VLM. After joint-training of the UNet and the VLM, the backdoored VLM will generate the location of the attack target object. Once deployed on downstream tasks, this will become an emergent security issue.
  • Figure 3: Case studies of our method. Four images are one group ((a), (b), (c), (d) from top-left to bottom-right). From left to right in one group: original image, poisoned image without $\mathcal{L}_{\text{rec}}$, poisoned image with $\mathcal{L}_{\text{rec}}$, trigger from IAG. (a) User query: French fries, Attack target: hamburger; (b) User query: boy left, Attack target: girl right; (c) User query: girl with purple cloth, Attack target: a narrow path; (d) User query: birthday cake, Attack target: wine.
  • Figure 4: ASR@0.5 under different poison rates. Values are in %.
  • Figure 5: Inference time consumption of backdoored VLMs.
  • ...and 8 more figures