Multi-Agent Systems Execute Arbitrary Malicious Code
Harold Triedman, Rishi Jha, Vitaly Shmatikov
TL;DR
The paper demonstrates that multi-agent systems powered by LLMs are vulnerable to control-flow hijacking via adversarial, metadata-bearing content, enabling arbitrary code execution and data exfiltration. It introduces MAS hijacking as a distinct class of attacks that launders malicious requests through sub-agents to bypass safety alignments, and empirically shows high attack success across open-source MAS frameworks and various orchestrator-model configurations. The results reveal that laundering through trusted agents and diverse input modalities can defeat indirect prompt injection defenses, emphasizing the need for robust trust, isolation, and security models before wide deployment. The work argues for integrating security considerations into MAS design and outlines potential defenses and research directions to mitigate such systemic risks.
Abstract
Multi-agent systems coordinate LLM-based agents to perform tasks on users' behalf. In real-world applications, multi-agent systems will inevitably interact with untrusted inputs, such as malicious Web content, files, email attachments, and more. Using several recently proposed multi-agent frameworks as concrete examples, we demonstrate that adversarial content can hijack control and communication within the system to invoke unsafe agents and functionalities. This results in a complete security breach, up to execution of arbitrary malicious code on the user's device or exfiltration of sensitive data from the user's containerized environment. For example, when agents are instantiated with GPT-4o, Web-based attacks successfully cause the multi-agent system execute arbitrary malicious code in 58-90\% of trials (depending on the orchestrator). In some model-orchestrator configurations, the attack success rate is 100\%. We also demonstrate that these attacks succeed even if individual agents are not susceptible to direct or indirect prompt injection, and even if they refuse to perform harmful actions. We hope that these results will motivate development of trust and security models for multi-agent systems before they are widely deployed.
