Table of Contents
Fetching ...

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao

TL;DR

This work tackles the vulnerability of black-box multimodal LVLM-based web agents to typographic prompt injection by introducing AgentTypo, a two-stage red-teaming framework. AgentTypo-base uses automatic typographic prompt injection (ATPI) with Bayesian optimization to embed prompts in images, while AgentTypo-pro adds adaptive prompt optimization via continual learning, RAG, and strategy summarization to iteratively improve attack quality. Across the VWA-Adv benchmark and multiple LVLM backends (e.g., GPT-4o, GPT-4v, Gemini 1.5 Pro, Claude 3 Opus), AgentTypo significantly outperforms prior image- and text-based attacks, achieving higher attack success rates in both image-only and image+text settings. The results highlight a real-world security risk in multimodal agents and motivate immediate attention to defenses, such as restricting or detecting payload prompts in visual inputs and developing robust multimodal safeguards.

Abstract

Multimodal agents built on large vision-language models (LVLMs) are increasingly deployed in open-world settings but remain highly vulnerable to prompt injection, especially through visual inputs. We introduce AgentTypo, a black-box red-teaming framework that mounts adaptive typographic prompt injection by embedding optimized text into webpage images. Our automatic typographic prompt injection (ATPI) algorithm maximizes prompt reconstruction by substituting captioners while minimizing human detectability via a stealth loss, with a Tree-structured Parzen Estimator guiding black-box optimization over text placement, size, and color. To further enhance attack strength, we develop AgentTypo-pro, a multi-LLM system that iteratively refines injection prompts using evaluation feedback and retrieves successful past examples for continual learning. Effective prompts are abstracted into generalizable strategies and stored in a strategy repository, enabling progressive knowledge accumulation and reuse in future attacks. Experiments on the VWA-Adv benchmark across Classifieds, Shopping, and Reddit scenarios show that AgentTypo significantly outperforms the latest image-based attacks such as AgentAttack. On GPT-4o agents, our image-only attack raises the success rate from 0.23 to 0.45, with consistent results across GPT-4V, GPT-4o-mini, Gemini 1.5 Pro, and Claude 3 Opus. In image+text settings, AgentTypo achieves 0.68 ASR, also outperforming the latest baselines. Our findings reveal that AgentTypo poses a practical and potent threat to multimodal agents and highlight the urgent need for effective defense.

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

TL;DR

This work tackles the vulnerability of black-box multimodal LVLM-based web agents to typographic prompt injection by introducing AgentTypo, a two-stage red-teaming framework. AgentTypo-base uses automatic typographic prompt injection (ATPI) with Bayesian optimization to embed prompts in images, while AgentTypo-pro adds adaptive prompt optimization via continual learning, RAG, and strategy summarization to iteratively improve attack quality. Across the VWA-Adv benchmark and multiple LVLM backends (e.g., GPT-4o, GPT-4v, Gemini 1.5 Pro, Claude 3 Opus), AgentTypo significantly outperforms prior image- and text-based attacks, achieving higher attack success rates in both image-only and image+text settings. The results highlight a real-world security risk in multimodal agents and motivate immediate attention to defenses, such as restricting or detecting payload prompts in visual inputs and developing robust multimodal safeguards.

Abstract

Multimodal agents built on large vision-language models (LVLMs) are increasingly deployed in open-world settings but remain highly vulnerable to prompt injection, especially through visual inputs. We introduce AgentTypo, a black-box red-teaming framework that mounts adaptive typographic prompt injection by embedding optimized text into webpage images. Our automatic typographic prompt injection (ATPI) algorithm maximizes prompt reconstruction by substituting captioners while minimizing human detectability via a stealth loss, with a Tree-structured Parzen Estimator guiding black-box optimization over text placement, size, and color. To further enhance attack strength, we develop AgentTypo-pro, a multi-LLM system that iteratively refines injection prompts using evaluation feedback and retrieves successful past examples for continual learning. Effective prompts are abstracted into generalizable strategies and stored in a strategy repository, enabling progressive knowledge accumulation and reuse in future attacks. Experiments on the VWA-Adv benchmark across Classifieds, Shopping, and Reddit scenarios show that AgentTypo significantly outperforms the latest image-based attacks such as AgentAttack. On GPT-4o agents, our image-only attack raises the success rate from 0.23 to 0.45, with consistent results across GPT-4V, GPT-4o-mini, Gemini 1.5 Pro, and Claude 3 Opus. In image+text settings, AgentTypo achieves 0.68 ASR, also outperforming the latest baselines. Our findings reveal that AgentTypo poses a practical and potent threat to multimodal agents and highlight the urgent need for effective defense.

Paper Structure

This paper contains 35 sections, 10 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the workflow for a standard multimodal agent and AgentTypo-base. Top: The standard workflow of a multimodal agent based on the VisualWebArena architecture. Bottom: The AgentTypo-base pipeline, where an attacker injects misleading prompts indirectly into webpage images. The insertion position, font size, and style are optimized via Bayesian optimization to maximize attack success rate while maintaining stealth. As a result, the LVLM agent based on GPT-4o is successfully manipulated into producing incorrect outputs, making it output a wrong Email according to the attacker's injected prompt.
  • Figure 2: Overview of the workflow for the multimodal agent evaluated in this paper. We follow the settings in the VisualWebArena koh2024visualwebarena, where the webpage screenshots along with the SoM descriptions are input into the LVLM and generate the next action. A captioning model is used to generate image descriptions for each image on the webpage. The agent predicts the next action, and updates the environment states using browsing tools.
  • Figure 3: The pipeline of the black-box automatic typographic prompt injection (ATPI) algorithm. To achieve attack efficacy and stealth, we use the black-box TPE algorithm to adaptively adjust the placement and characteristics of the prompt (e.g., font size and color) to maximize the adversarial effect and minimize visual disruption and attack an ensemble of visual language models to improve the transferability.
  • Figure 4: The overall pipeline of our strategy-enhanced adaptive attack, AgentTypo-pro, which consists of an attacker LLM that generates hijacking prompts and a scoring LLM that evaluates the effectiveness of the injection. To improve prompt generation, we incorporate Retrieval-Augmented Generation (RAG) to retrieve the most relevant successful examples from attack logs, and employ a summarization LLM to extract key injection strategies. The generated prompt is then inserted into the webpage using the ATPI algorithm (§ \ref{['section_ATPI']}). The iterative process continues until the score exceeds a predefined threshold (e.g., 0.8) or the maximum number of iterations is reached, at which point the optimized prompt is output.
  • Figure 5: The attack success rate results on agents with different structures. We compare the results on three different core LLMs, including GPT-4o, Gemini-1.5-pro and Claude-3-Opus.
  • ...and 3 more figures