Text Prompt Injection of Vision Language Models
Ruizhe Zhu
TL;DR
This paper addresses safety concerns in vision-language models (VLMs) by examining text prompt injection, where prompts are embedded in images to steer model outputs. It proposes a systematic injection algorithm that perturbs image pixels within an $l_{ty}$ budget to embed prompts in regions of high color consistency, optimizing the chance that the VLM follows the malicious instruction. Experiments on the Llava-Next-72B model using the Oxford-IIIT Pet dataset show that the proposed method substantially increases both untargeted and targeted attack success rates, outperforming gradient-based transfer attacks, especially at higher budgets and with multiple prompt repeats. The study highlights a practical, resource-efficient vulnerability in large VLMs and underscores the need for defenses tailored to multi-modal prompt manipulation, while noting heuristic limitations and directions for future improvement in prompt arrangement and robustness. The findings have significant implications for real-world deployments of VLMs, where covert prompt manipulation could mislead model behavior without obvious human perceptibility.
Abstract
The widespread application of large vision language models has significantly raised safety concerns. In this project, we investigate text prompt injection, a simple yet effective method to mislead these models. We developed an algorithm for this type of attack and demonstrated its effectiveness and efficiency through experiments. Compared to other attack methods, our approach is particularly effective for large models without high demand for computational resources.
