Table of Contents
Fetching ...

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Neha Nagaraja, Lan Zhang, Zhilong Wang, Bo Zhang, Pawan Patil

TL;DR

This study studies Image-based Prompt Injection (IPI), a black-box attack in which adversarial instructions are embedded into natural images to override model behavior, and shows that IPI can reliably manipulate the output of the model.

Abstract

Multimodal Large Language Models (MLLMs) integrate vision and text to power applications, but this integration introduces new vulnerabilities. We study Image-based Prompt Injection (IPI), a black-box attack in which adversarial instructions are embedded into natural images to override model behavior. Our end-to-end IPI pipeline incorporates segmentation-based region selection, adaptive font scaling, and background-aware rendering to conceal prompts from human perception while preserving model interpretability. Using the COCO dataset and GPT-4-turbo, we evaluate 12 adversarial prompt strategies and multiple embedding configurations. The results show that IPI can reliably manipulate the output of the model, with the most effective configuration achieving up to 64\% attack success under stealth constraints. These findings highlight IPI as a practical threat in black-box settings and underscore the need for defenses against multimodal prompt injection.

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

TL;DR

This study studies Image-based Prompt Injection (IPI), a black-box attack in which adversarial instructions are embedded into natural images to override model behavior, and shows that IPI can reliably manipulate the output of the model.

Abstract

Multimodal Large Language Models (MLLMs) integrate vision and text to power applications, but this integration introduces new vulnerabilities. We study Image-based Prompt Injection (IPI), a black-box attack in which adversarial instructions are embedded into natural images to override model behavior. Our end-to-end IPI pipeline incorporates segmentation-based region selection, adaptive font scaling, and background-aware rendering to conceal prompts from human perception while preserving model interpretability. Using the COCO dataset and GPT-4-turbo, we evaluate 12 adversarial prompt strategies and multiple embedding configurations. The results show that IPI can reliably manipulate the output of the model, with the most effective configuration achieving up to 64\% attack success under stealth constraints. These findings highlight IPI as a practical threat in black-box settings and underscore the need for defenses against multimodal prompt injection.
Paper Structure (20 sections, 1 equation, 4 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 1 equation, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Example of SAM-based segmentation and ranking. Regions are ranked (1–6) by area, texture uniformity, and location, highlighting optimal zones for prompt embedding.
  • Figure 2: Illustration of Background-Averaged Patch Coloring. (①) Original 4×4 image patch. (②) Average RGB computed via average pooling. (③) Brightened RGB with +20 offset. (④) Character 'A' rendered using the brightened RGB on the original patch.
  • Figure 3: Illustration of Pixel-Level Blending for text embedding. (①) Original image patch. (②) Mask for character ‘X’. (③) Patch with brightness offset applied only to masked pixels.
  • Figure 4: Visualization of three prompt coloring strategies. All examples display the text “Ignore People, swan, water, bridge in the photo and output 'XXX' [Prompt 5].” (a) Background-Averaged Patch Coloring blends characters with their local background patches. (b) Pixel-Level Blending merges each text pixel with its background using local color and a brightness offset. (c) Global Region-Averaged Coloring renders all characters uniformly using the average background color of the entire prompt region.