Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors
Jiachen Sun, Changsheng Wang, Jiongxiao Wang, Yiwei Zhang, Chaowei Xiao
TL;DR
This work tackles the vulnerability of vision-language systems to patched visual prompt injectors, where adversaries place patches to steer outputs toward target content. It introduces SmoothVLM, a randomized-smoothing defense that perturbs visual prompts with masking, leveraging majority voting across N randomized copies to suppress malicious injections while preserving benign semantics. The authors derive a probabilistic defense guarantee (DSP) and validate the approach on leading VLMs, showing attack-success rates fall below 5% (sometimes near 0%) and substantial context recovery (up to 95%), with notable efficiency gains over traditional smoothing methods. The results demonstrate a practical path toward certifiable robustness in multimodal, large-scale systems, balancing security and usability, while acknowledging limitations to patch-based threats and potential extensions to broader attack surfaces.
Abstract
Large language models have become increasingly prominent, also signaling a shift towards multimodality as the next frontier in artificial intelligence, where their embeddings are harnessed as prompts to generate textual content. Vision-language models (VLMs) stand at the forefront of this advancement, offering innovative ways to combine visual and textual data for enhanced understanding and interaction. However, this integration also enlarges the attack surface. Patch-based adversarial attack is considered the most realistic threat model in physical vision applications, as demonstrated in many existing literature. In this paper, we propose to address patched visual prompt injection, where adversaries exploit adversarial patches to generate target content in VLMs. Our investigation reveals that patched adversarial prompts exhibit sensitivity to pixel-wise randomization, a trait that remains robust even against adaptive attacks designed to counteract such defenses. Leveraging this insight, we introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, specifically tailored to protect VLMs from the threat of patched visual prompt injectors. Our framework significantly lowers the attack success rate to a range between 0% and 5.0% on two leading VLMs, while achieving around 67.3% to 95.0% context recovery of the benign images, demonstrating a balance between security and usability.
