AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents
Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao
TL;DR
This work tackles the vulnerability of black-box multimodal LVLM-based web agents to typographic prompt injection by introducing AgentTypo, a two-stage red-teaming framework. AgentTypo-base uses automatic typographic prompt injection (ATPI) with Bayesian optimization to embed prompts in images, while AgentTypo-pro adds adaptive prompt optimization via continual learning, RAG, and strategy summarization to iteratively improve attack quality. Across the VWA-Adv benchmark and multiple LVLM backends (e.g., GPT-4o, GPT-4v, Gemini 1.5 Pro, Claude 3 Opus), AgentTypo significantly outperforms prior image- and text-based attacks, achieving higher attack success rates in both image-only and image+text settings. The results highlight a real-world security risk in multimodal agents and motivate immediate attention to defenses, such as restricting or detecting payload prompts in visual inputs and developing robust multimodal safeguards.
Abstract
Multimodal agents built on large vision-language models (LVLMs) are increasingly deployed in open-world settings but remain highly vulnerable to prompt injection, especially through visual inputs. We introduce AgentTypo, a black-box red-teaming framework that mounts adaptive typographic prompt injection by embedding optimized text into webpage images. Our automatic typographic prompt injection (ATPI) algorithm maximizes prompt reconstruction by substituting captioners while minimizing human detectability via a stealth loss, with a Tree-structured Parzen Estimator guiding black-box optimization over text placement, size, and color. To further enhance attack strength, we develop AgentTypo-pro, a multi-LLM system that iteratively refines injection prompts using evaluation feedback and retrieves successful past examples for continual learning. Effective prompts are abstracted into generalizable strategies and stored in a strategy repository, enabling progressive knowledge accumulation and reuse in future attacks. Experiments on the VWA-Adv benchmark across Classifieds, Shopping, and Reddit scenarios show that AgentTypo significantly outperforms the latest image-based attacks such as AgentAttack. On GPT-4o agents, our image-only attack raises the success rate from 0.23 to 0.45, with consistent results across GPT-4V, GPT-4o-mini, Gemini 1.5 Pro, and Claude 3 Opus. In image+text settings, AgentTypo achieves 0.68 ASR, also outperforming the latest baselines. Our findings reveal that AgentTypo poses a practical and potent threat to multimodal agents and highlight the urgent need for effective defense.
