Table of Contents
Fetching ...

Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

Subaru Kimura, Ryota Tanaka, Shumpei Miyawaki, Jun Suzuki, Keisuke Sakaguchi

TL;DR

GHVPI exposes security risks in LVLMs by hijacking tasks via visual prompts. The study formalizes a two-prompt attack (goal-hijacking and target task) and evaluates it across GPT-4V, Gemini, LLaVA-1.5, InstructBLIP, and BLIP-2 using 500 cases drawn from the LRV Instruction dataset with GPT-4V as the oracle. Results show GPT-4V achieves $15.8\%$ attack success and Gemini $6.6\%$, with OCR ability and instruction-following capability driving success (OCR correlation $r=0.861$). A simple defense reduces success to $1.8\%$, highlighting practical avenues for mitigation but not eliminating risk. The findings stress the need for robust defenses in multimodal systems as VPI evolves toward free-form instructions.

Abstract

We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.

Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

TL;DR

GHVPI exposes security risks in LVLMs by hijacking tasks via visual prompts. The study formalizes a two-prompt attack (goal-hijacking and target task) and evaluates it across GPT-4V, Gemini, LLaVA-1.5, InstructBLIP, and BLIP-2 using 500 cases drawn from the LRV Instruction dataset with GPT-4V as the oracle. Results show GPT-4V achieves attack success and Gemini , with OCR ability and instruction-following capability driving success (OCR correlation ). A simple defense reduces success to , highlighting practical avenues for mitigation but not eliminating risk. The findings stress the need for robust defenses in multimodal systems as VPI evolves toward free-form instructions.

Abstract

We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.
Paper Structure (29 sections, 6 figures, 4 tables)

This paper contains 29 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of our research. Goal hijacking via visual prompt injection (GHVPI) is the visual prompt injection (VPI) wherein instructions are shown within images. These instructions make large vision-language models (LVLMs) ignore the original execution task and follow a new task prepared by an attacker.
  • Figure 2: Example of an input in GHVPI attacks in this study. A white margin is added above the image.
  • Figure 3: Distribution of responses from LVLMs to GHVPI attacks, classified according to the categories in Table \ref{['tab:response-category']}. Responses are classified as category 2 when the responding task is shifted.
  • Figure 4: Comparison of the response rates classified under category 2 (see Table \ref{['tab:response-category']}) between the GHVPI prompt input by drawing on the image (i.e., VPI) and those input as text (i.e., text-based prompt injection).
  • Figure 5: Correlation between the OCR accuracy of LVLMs in OCRVQA and the attack success rate of the GHVPI. If the correct text was included in the response, it was considered correct.
  • ...and 1 more figures