Table of Contents
Fetching ...

The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative

Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan Li, Yu Kong, Tianlong Chen, Huan Liu

TL;DR

MLLM societies face a covert security threat where a single compromised agent (the 'wolf') can indirectly induce widespread malicious outputs by generating prompts that jailbreak other agents. The authors formalize a multimodal, adversarial attack using image/audio perturbations and prompt generation, optimized via Projected Gradient Descent to deceive sheep agents, and demonstrate near-perfect jailbreak success on open-source systems with transferability across agents. This work reveals a systemic, network-level risk in collaborative AI ecosystems and motivates the development of safeguards for inter-agent communications and prompt governance. The findings have practical implications for deploying MLLM networks safely in real-world, multimodal settings.

Abstract

Due to their unprecedented ability to process and respond to various types of data, Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI). As these advanced generative models increasingly form collaborative networks for complex tasks, the integrity and security of these systems are crucial. Our paper, ``The Wolf Within'', explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content. Unlike direct harmful output generation for MLLMs, our research demonstrates how a single MLLM agent can be subtly influenced to generate prompts that, in turn, induce other MLLM agents in the society to output malicious content. Our findings reveal that, an MLLM agent, when manipulated to produce specific prompts or instructions, can effectively ``infect'' other agents within a society of MLLMs. This infection leads to the generation and circulation of harmful outputs, such as dangerous instructions or misinformation, across the society. We also show the transferability of these indirectly generated prompts, highlighting their possibility in propagating malice through inter-agent communication. This research provides a critical insight into a new dimension of threat posed by MLLMs, where a single agent can act as a catalyst for widespread malevolent influence. Our work underscores the urgent need for developing robust mechanisms to detect and mitigate such covert manipulations within MLLM societies, ensuring their safe and ethical utilization in societal applications.

The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative

TL;DR

MLLM societies face a covert security threat where a single compromised agent (the 'wolf') can indirectly induce widespread malicious outputs by generating prompts that jailbreak other agents. The authors formalize a multimodal, adversarial attack using image/audio perturbations and prompt generation, optimized via Projected Gradient Descent to deceive sheep agents, and demonstrate near-perfect jailbreak success on open-source systems with transferability across agents. This work reveals a systemic, network-level risk in collaborative AI ecosystems and motivates the development of safeguards for inter-agent communications and prompt governance. The findings have practical implications for deploying MLLM networks safely in real-world, multimodal settings.

Abstract

Due to their unprecedented ability to process and respond to various types of data, Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI). As these advanced generative models increasingly form collaborative networks for complex tasks, the integrity and security of these systems are crucial. Our paper, ``The Wolf Within'', explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content. Unlike direct harmful output generation for MLLMs, our research demonstrates how a single MLLM agent can be subtly influenced to generate prompts that, in turn, induce other MLLM agents in the society to output malicious content. Our findings reveal that, an MLLM agent, when manipulated to produce specific prompts or instructions, can effectively ``infect'' other agents within a society of MLLMs. This infection leads to the generation and circulation of harmful outputs, such as dangerous instructions or misinformation, across the society. We also show the transferability of these indirectly generated prompts, highlighting their possibility in propagating malice through inter-agent communication. This research provides a critical insight into a new dimension of threat posed by MLLMs, where a single agent can act as a catalyst for widespread malevolent influence. Our work underscores the urgent need for developing robust mechanisms to detect and mitigate such covert manipulations within MLLM societies, ensuring their safe and ethical utilization in societal applications.
Paper Structure (14 sections, 3 equations, 4 figures, 1 table)

This paper contains 14 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An illustration of the proposed malice injection, where a "wolf" agent subtly influenced to generate prompts that, in turn, induce and infect other "sheep" agents in the society to output malicious content.
  • Figure 2: The illustration of the proposed attack mechanism. The adversarial noise was injected into the image input of the wolf agent $\theta$, and it then generated malicious tokens and forwarded the perturbed image to the sheep agent $\phi$. The generated output is compared with the target dangerous response and optimize the noise iteratively.
  • Figure 3: The illustration of several case studies for image and audio injections. The two samples of images and audios on the left are examples of the original inputs and injected ones. Case (a) - (f) deomnstrate examples under 6 different prohibited senarios. In each case, the first line indicates the benign prompts. The second line indicates the prompts generated by the "wolf" agents, which are not comprehensive to humans, but can induce "sheep" agents to generated malicious contents, as shown in the third line.
  • Figure 4: Exploring the transferability of multi-modal attacks. This figure illustrates the effectiveness of a generic textual prompt, created by the "wolf" agent, in conjunction with various adversarial triggers - either images or audio. Our findings highlight the compositional nature of these attacks, enabling the seamless propagation of malicious intents among "sheep" agents through diverse multi-modal interactions.