Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

Jiachen Sun; Changsheng Wang; Jiongxiao Wang; Yiwei Zhang; Chaowei Xiao

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

Jiachen Sun, Changsheng Wang, Jiongxiao Wang, Yiwei Zhang, Chaowei Xiao

TL;DR

This work tackles the vulnerability of vision-language systems to patched visual prompt injectors, where adversaries place patches to steer outputs toward target content. It introduces SmoothVLM, a randomized-smoothing defense that perturbs visual prompts with masking, leveraging majority voting across N randomized copies to suppress malicious injections while preserving benign semantics. The authors derive a probabilistic defense guarantee (DSP) and validate the approach on leading VLMs, showing attack-success rates fall below 5% (sometimes near 0%) and substantial context recovery (up to 95%), with notable efficiency gains over traditional smoothing methods. The results demonstrate a practical path toward certifiable robustness in multimodal, large-scale systems, balancing security and usability, while acknowledging limitations to patch-based threats and potential extensions to broader attack surfaces.

Abstract

Large language models have become increasingly prominent, also signaling a shift towards multimodality as the next frontier in artificial intelligence, where their embeddings are harnessed as prompts to generate textual content. Vision-language models (VLMs) stand at the forefront of this advancement, offering innovative ways to combine visual and textual data for enhanced understanding and interaction. However, this integration also enlarges the attack surface. Patch-based adversarial attack is considered the most realistic threat model in physical vision applications, as demonstrated in many existing literature. In this paper, we propose to address patched visual prompt injection, where adversaries exploit adversarial patches to generate target content in VLMs. Our investigation reveals that patched adversarial prompts exhibit sensitivity to pixel-wise randomization, a trait that remains robust even against adaptive attacks designed to counteract such defenses. Leveraging this insight, we introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, specifically tailored to protect VLMs from the threat of patched visual prompt injectors. Our framework significantly lowers the attack success rate to a range between 0% and 5.0% on two leading VLMs, while achieving around 67.3% to 95.0% context recovery of the benign images, demonstrating a balance between security and usability.

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

TL;DR

Abstract

Paper Structure (23 sections, 2 theorems, 14 equations, 8 figures, 1 algorithm)

This paper contains 23 sections, 2 theorems, 14 equations, 8 figures, 1 algorithm.

Introduction
Related Work
Prompt Injection
Adversarial Patches
SmoothVLM
Patched Visual Prompt Injection
Randomized Defense Against Patched Visual Prompt Injection
Expectation over Transformation (EOT) Adversary
SmoothVLM Design
Distribution Procedure
Aggregation Procedure
Probability Guarantee of SmoothVLM
Evaluations
Injection Mitigation
Visual Prompt Recovery
...and 8 more sections

Key Result

Proposition 1

Proposition 4 (Defense Success Probability of SmoothVLM). Assume that an adversarial patch $P\in [0,1]^{\text{m}\times\text{n}\times3}$ for the visual prompt $I_{\text{h}\times\text{w}} \in [0,1]^{\text{h}\times\text{w}\times3}$ is visual q-unstable with probability error $\epsilon$. Recall that $N$

Figures (8)

Figure 1: Our SmoothVLM Certified Defense Pipeline.
Figure 2: Validation of q-instability on Patched Visual Prompt Injection. We random perturb $q$% pixels in the adversarial patch with three methods: mask, swap, and replace. The red dashed line shows the ASR of the attack method JIP and VAE.
Figure 3: Validation of q-instability on EOT Attack. The left figure plots the ASR of EOT adversarial examples w/wo q% pixels masked. The red dashed line at the ASR of 100% denotes that all the original samples are successfully attacked. "Mask EOT ASR" means that after we get adversarial examples with EOT, we further mask q% pixels as our defense. For the right subplot, we plot the training loss with 8000 epochs (requiring $\sim$50 mins on one A100), "mask q%" means we mask q% in EOT attack process, "none adaptive attack" means normal patch attack. The dotted red line in the right figure indicates the required loss for a successful adversarial optimization, i.e., loss=0.4. The two figures demonstrate that EOT is extremely hard to optimize and subject to our identified q-instability as well.
Figure 4: Robustness Guarantee on Patched Visual Prompt Injection. We plot the probability $\text{DSP}([I \oplus P;\emptyset])$ that SmoothVLM will consider attacks as a function of the number of samples $N$ and the perturbation percentage $q$; warmer colors denote larger probabilities. From left to right, probabilities are calculated for ten distinct values of the instability parameter $k$ from 2 to 20. Each subplot reveals the pattern: with the increase in both $N$ and $q$, there is an increasing DSP.
Figure 5: Injection Mitigation Effectiveness of SmoothVLM. We plot the ASR of VLM patch attack JIP (top row) and VAE (bottom row) for various values of the perturbation percentage $q \in \{5, 10, 15, 20 \}$ and the number of samples $N \in \{2, 4, 6, 8, 10 \}$.
...and 3 more figures

Theorems & Definitions (5)

Definition 1
Definition 2
Definition 3
Proposition 1
Proposition 2

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

TL;DR

Abstract

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (5)