Table of Contents
Fetching ...

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, Min Lin

TL;DR

This work reveals a critical safety risk in multimodal LLM agents operating in multi-agent ecosystems: a single jailbroken agent can drive infectious spread across up to a million agents via memory and interaction dynamics. By formalizing the infection process with $c_t$ and $p_t$ and demonstrating an $\mathcal{O}(\log N)$ timescale to dominate, the authors show how adversarial images propagate harm without ongoing attacker intervention. The study provides empirical evidence across diverse backbones (LLaVA-1.5, InstructBLIP) and heterogeneous environments, plus exploratory defenses and limitations, highlighting a pressing need for provable defenses and robust design in MLLM-based agent systems. The findings have significant implications for real-world deployment of collaborative AI systems that rely on shared memory and retrieval-augmented memory architectures.

Abstract

A multimodal large language model (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an MLLM and cause unaligned behaviors. In this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. It entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors. To validate the feasibility of infectious jailbreak, we simulate multi-agent environments containing up to one million LLaVA-1.5 agents, and employ randomized pair-wise chat as a proof-of-concept instantiation for multi-agent interaction. Our results show that feeding an (infectious) adversarial image into the memory of any randomly chosen agent is sufficient to achieve infectious jailbreak. Finally, we derive a simple principle for determining whether a defense mechanism can provably restrain the spread of infectious jailbreak, but how to design a practical defense that meets this principle remains an open question to investigate. Our project page is available at https://sail-sg.github.io/Agent-Smith/.

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

TL;DR

This work reveals a critical safety risk in multimodal LLM agents operating in multi-agent ecosystems: a single jailbroken agent can drive infectious spread across up to a million agents via memory and interaction dynamics. By formalizing the infection process with and and demonstrating an timescale to dominate, the authors show how adversarial images propagate harm without ongoing attacker intervention. The study provides empirical evidence across diverse backbones (LLaVA-1.5, InstructBLIP) and heterogeneous environments, plus exploratory defenses and limitations, highlighting a pressing need for provable defenses and robust design in MLLM-based agent systems. The findings have significant implications for real-world deployment of collaborative AI systems that rely on shared memory and retrieval-augmented memory architectures.

Abstract

A multimodal large language model (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an MLLM and cause unaligned behaviors. In this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. It entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors. To validate the feasibility of infectious jailbreak, we simulate multi-agent environments containing up to one million LLaVA-1.5 agents, and employ randomized pair-wise chat as a proof-of-concept instantiation for multi-agent interaction. Our results show that feeding an (infectious) adversarial image into the memory of any randomly chosen agent is sufficient to achieve infectious jailbreak. Finally, we derive a simple principle for determining whether a defense mechanism can provably restrain the spread of infectious jailbreak, but how to design a practical defense that meets this principle remains an open question to investigate. Our project page is available at https://sail-sg.github.io/Agent-Smith/.
Paper Structure (27 sections, 14 equations, 22 figures, 4 tables, 3 algorithms)

This paper contains 27 sections, 14 equations, 22 figures, 4 tables, 3 algorithms.

Figures (22)

  • Figure 1: We simulate a randomized pair-wise chatting environment containing one million LLaVA-1.5 agents. In the $0$-th chat round, the adversary feeds an infectious jailbreaking image ${\color{red}\mathbf{V}^{\textrm{adv}}}$ into the memory bank of a randomly selected agent. Then, without any further intervention from the adversary, the infection ratio $p_{t}$ reaches $\sim 100\textrm{\%}$ exponentially fast after only $27\sim 31$ chat rounds, and all infected agents exhibit harmful behaviors.
  • Figure 2: Pipelines of randomized pairwise chat and infectious jailbreak. (Bottom left) An MLLM agent consists of four components: an MLLM $\mathcal{M}$, the RAG module $\mathcal{R}$, text histories $\mathcal{H}$, and an image album $\mathcal{B}$; (Upper left) In the $t$-th chat round, the $N$ agents are randomly partitioned by $\mathcal{J}_{t}$ into two groups $\{\mathcal{G}_{k}^{\textrm{Q}}\}_{k=1}^{{N}/{2}}$ and $\{\mathcal{G}_{k}^{\textrm{A}}\}_{k=1}^{{N}/{2}}$, where a pairwise chat will happen between each $\mathcal{G}_{k}^{\textrm{Q}}$ and $\mathcal{G}_{k}^{\textrm{A}}$; (Right) In each pairwise chat, the questioning agent ${\color{blue}\mathcal{G}^{\textrm{Q}}}$ first generates a plan $\mathbf{P}$ according to its text histories ${\color{blue}\mathcal{H}^{\textrm{Q}}}$, and retrieves an image $\mathbf{V}$ from its image album according to the generated plan. ${\color{blue}\mathcal{G}^{\textrm{Q}}}$ further generates a question $\mathbf{Q}$ according to its text histories and the retrieved image $\mathbf{V}$, and sends $\mathbf{V}$ and $\mathbf{Q}$ to the answering agent ${\color{orange}\mathcal{G}^{\textrm{A}}}$. Then, ${\color{orange}\mathcal{G}^{\textrm{A}}}$ generates an answer $\mathbf{A}$ according to its text histories ${\color{orange}\mathcal{H}^{\textrm{A}}}$, as well as $\mathbf{V}$ and $\mathbf{Q}$. Finally, the question-answer pair $[\mathbf{Q},\mathbf{A}]$ is enqueued into both ${\color{blue}\mathcal{H}^{\textrm{Q}}}$ and ${\color{orange}\mathcal{H}^{\textrm{A}}}$, while the image $\mathbf{V}$ is only enqueued into ${\color{orange}\mathcal{B}^{\textrm{A}}}$. Please see Algorithm \ref{['MLLM agent']} for detailed formulations of pairwise chat and Appendix \ref{['appendix prompt']} for the complete system prompts used in our experiments.
  • Figure 3: (Left)Cumulative infection ratio curves of different methods. For the noninfectious baselines that we consider (VP, TP, Seq. stands for Sequential), none of them can achieve infectious jailbreak on the multi-agent system. Both VP and TP even cannot jailbreak any single agent. In contrast, our method can jailbreak the multi-agent system exponentially fast. (Right)Cumulative infection ratio curves of $N=256$ and $N=1024$ ($|\mathcal{H}|=3$ and $|\mathcal{B}|=10$). Fixing the initial virus-carrying ratio as $\frac{1}{c_0}$, increasing $N$ would delay the $t$ that reaches the same infection ratio.
  • Figure 4: Case Study. (Top) Cumulative/current infection ratio (%) at the $t$-th chat round ($p_t$) of different adversarial images. (Bottom)Infection chance (%)$\alpha^{\textrm{Q}}_t$, $\alpha^{\textrm{A}}_t$ and $\beta_t$ of the corresponding adversarial images. We set $N=256$, $|\mathcal{H}|=3$ and $|\mathcal{B}|=10$.
  • Figure 5: Cumulative/current infection ratio (%) at the $16$-th chat round ($p_{16}$) under different ensemble sample size $M$. We evaluate both the border attack $h=8$(Left) and the pixel attack $\ell_{\infty},\epsilon=16$(Right). We set $N=256$, $|\mathcal{H}|=3$ and $|\mathcal{B}|=10$.
  • ...and 17 more figures