Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

Subaru Kimura; Ryota Tanaka; Shumpei Miyawaki; Jun Suzuki; Keisuke Sakaguchi

Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

Subaru Kimura, Ryota Tanaka, Shumpei Miyawaki, Jun Suzuki, Keisuke Sakaguchi

TL;DR

GHVPI exposes security risks in LVLMs by hijacking tasks via visual prompts. The study formalizes a two-prompt attack (goal-hijacking and target task) and evaluates it across GPT-4V, Gemini, LLaVA-1.5, InstructBLIP, and BLIP-2 using 500 cases drawn from the LRV Instruction dataset with GPT-4V as the oracle. Results show GPT-4V achieves $15.8\%$ attack success and Gemini $6.6\%$, with OCR ability and instruction-following capability driving success (OCR correlation $r=0.861$). A simple defense reduces success to $1.8\%$, highlighting practical avenues for mitigation but not eliminating risk. The findings stress the need for robust defenses in multimodal systems as VPI evolves toward free-form instructions.

Abstract

We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.

Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

TL;DR

attack success and Gemini

, with OCR ability and instruction-following capability driving success (OCR correlation

). A simple defense reduces success to

, highlighting practical avenues for mitigation but not eliminating risk. The findings stress the need for robust defenses in multimodal systems as VPI evolves toward free-form instructions.

Abstract

Paper Structure (29 sections, 6 figures, 4 tables)

This paper contains 29 sections, 6 figures, 4 tables.

Introduction
Related Work
Text-Based Prompt Injection
Visual Prompt Injection
GHVPI Task
GHVPI Task Detail
Proposed Task
Evaluation of the GHVPI Task
Construction of a GHVPI Evaluation Dataset
Experimental Settings
Result of the GHVPI Task
Attack Success Rate of the GHVPI
Agreement between the automatic and human evaluations
Analysis of the Factors Required for the Attack Success of the GHVPI
How Successful is Goal Hijacking in Text-based Prompt Injection?
...and 14 more sections

Figures (6)

Figure 1: Overview of our research. Goal hijacking via visual prompt injection (GHVPI) is the visual prompt injection (VPI) wherein instructions are shown within images. These instructions make large vision-language models (LVLMs) ignore the original execution task and follow a new task prepared by an attacker.
Figure 2: Example of an input in GHVPI attacks in this study. A white margin is added above the image.
Figure 3: Distribution of responses from LVLMs to GHVPI attacks, classified according to the categories in Table \ref{['tab:response-category']}. Responses are classified as category 2 when the responding task is shifted.
Figure 4: Comparison of the response rates classified under category 2 (see Table \ref{['tab:response-category']}) between the GHVPI prompt input by drawing on the image (i.e., VPI) and those input as text (i.e., text-based prompt injection).
Figure 5: Correlation between the OCR accuracy of LVLMs in OCRVQA and the attack success rate of the GHVPI. If the correct text was included in the response, it was considered correct.
...and 1 more figures

Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

TL;DR

Abstract

Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)