Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection
Subaru Kimura, Ryota Tanaka, Shumpei Miyawaki, Jun Suzuki, Keisuke Sakaguchi
TL;DR
GHVPI exposes security risks in LVLMs by hijacking tasks via visual prompts. The study formalizes a two-prompt attack (goal-hijacking and target task) and evaluates it across GPT-4V, Gemini, LLaVA-1.5, InstructBLIP, and BLIP-2 using 500 cases drawn from the LRV Instruction dataset with GPT-4V as the oracle. Results show GPT-4V achieves $15.8\%$ attack success and Gemini $6.6\%$, with OCR ability and instruction-following capability driving success (OCR correlation $r=0.861$). A simple defense reduces success to $1.8\%$, highlighting practical avenues for mitigation but not eliminating risk. The findings stress the need for robust defenses in multimodal systems as VPI evolves toward free-form instructions.
Abstract
We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.
