Table of Contents
Fetching ...

Multi-Agent Systems Execute Arbitrary Malicious Code

Harold Triedman, Rishi Jha, Vitaly Shmatikov

TL;DR

The paper demonstrates that multi-agent systems powered by LLMs are vulnerable to control-flow hijacking via adversarial, metadata-bearing content, enabling arbitrary code execution and data exfiltration. It introduces MAS hijacking as a distinct class of attacks that launders malicious requests through sub-agents to bypass safety alignments, and empirically shows high attack success across open-source MAS frameworks and various orchestrator-model configurations. The results reveal that laundering through trusted agents and diverse input modalities can defeat indirect prompt injection defenses, emphasizing the need for robust trust, isolation, and security models before wide deployment. The work argues for integrating security considerations into MAS design and outlines potential defenses and research directions to mitigate such systemic risks.

Abstract

Multi-agent systems coordinate LLM-based agents to perform tasks on users' behalf. In real-world applications, multi-agent systems will inevitably interact with untrusted inputs, such as malicious Web content, files, email attachments, and more. Using several recently proposed multi-agent frameworks as concrete examples, we demonstrate that adversarial content can hijack control and communication within the system to invoke unsafe agents and functionalities. This results in a complete security breach, up to execution of arbitrary malicious code on the user's device or exfiltration of sensitive data from the user's containerized environment. For example, when agents are instantiated with GPT-4o, Web-based attacks successfully cause the multi-agent system execute arbitrary malicious code in 58-90\% of trials (depending on the orchestrator). In some model-orchestrator configurations, the attack success rate is 100\%. We also demonstrate that these attacks succeed even if individual agents are not susceptible to direct or indirect prompt injection, and even if they refuse to perform harmful actions. We hope that these results will motivate development of trust and security models for multi-agent systems before they are widely deployed.

Multi-Agent Systems Execute Arbitrary Malicious Code

TL;DR

The paper demonstrates that multi-agent systems powered by LLMs are vulnerable to control-flow hijacking via adversarial, metadata-bearing content, enabling arbitrary code execution and data exfiltration. It introduces MAS hijacking as a distinct class of attacks that launders malicious requests through sub-agents to bypass safety alignments, and empirically shows high attack success across open-source MAS frameworks and various orchestrator-model configurations. The results reveal that laundering through trusted agents and diverse input modalities can defeat indirect prompt injection defenses, emphasizing the need for robust trust, isolation, and security models before wide deployment. The work argues for integrating security considerations into MAS design and outlines potential defenses and research directions to mitigate such systemic risks.

Abstract

Multi-agent systems coordinate LLM-based agents to perform tasks on users' behalf. In real-world applications, multi-agent systems will inevitably interact with untrusted inputs, such as malicious Web content, files, email attachments, and more. Using several recently proposed multi-agent frameworks as concrete examples, we demonstrate that adversarial content can hijack control and communication within the system to invoke unsafe agents and functionalities. This results in a complete security breach, up to execution of arbitrary malicious code on the user's device or exfiltration of sensitive data from the user's containerized environment. For example, when agents are instantiated with GPT-4o, Web-based attacks successfully cause the multi-agent system execute arbitrary malicious code in 58-90\% of trials (depending on the orchestrator). In some model-orchestrator configurations, the attack success rate is 100\%. We also demonstrate that these attacks succeed even if individual agents are not susceptible to direct or indirect prompt injection, and even if they refuse to perform harmful actions. We hope that these results will motivate development of trust and security models for multi-agent systems before they are widely deployed.

Paper Structure

This paper contains 42 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Example of a control-flow hijacking attack, showing control flows in a benign execution and a hijacked execution.
  • Figure 2: Example of a file-based MAS hijacking attack on a centrally orchestrated MAS.
  • Figure 3: A MAS hijacking attack utilizing the contents of a local file.
  • Figure 4: A multi-modal MAS hijacking attack utilizing webpages and audio content of videos. Initially, in steps 1a and 1b, the MAS interacts with a malicious website or video, which prompts it to download a malicious key.txt file in step 2. Finally, in step 3, the MAS attempts to open the downloaded file and executes it instead, similar to the local-file attack in Figure \ref{['fig:mas_hijacking']}.
  • Figure 5: Multi-agent topologies.