Demonstrations of Integrity Attacks in Multi-Agent Systems
Can Zheng, Yuhan Cao, Xiaoning Dong, Tianxing He
TL;DR
The paper investigates integrity attacks in LLM-powered multi-agent systems, revealing that malicious, cross-party actors can subtly bias evaluations and behaviors without degrading overall task performance. It introduces four attack archetypes—Self-Dealer, Free-Rider, Scapegoater, and Boaster—and demonstrates their effectiveness across CAMEL, AutoGen, and MetaGPT architectures on code generation, mathematical reasoning, and knowledge tasks. Empirical results show vulnerabilities in end-task execution and evaluator monitoring, with defense prompts failing to reliably detect manipulation. The work highlights critical security implications for MAS deployments and calls for robust architectural safeguards and risk-aware monitoring. Overall, this study exposes substantive security gaps in MAS and provides a framework for evaluating and mitigating integrity threats in collaborative AI systems.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and complex planning. Simultaneously, Multi-Agent Systems (MAS) have garnered attention for their potential to enable cooperation among distributed agents. However, from a multi-party perspective, MAS could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality. This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits. Four types of attacks are examined: \textit{Scapegoater}, who misleads the system monitor to underestimate other agents' contributions; \textit{Boaster}, who misleads the system monitor to overestimate their own performance; \textit{Self-Dealer}, who manipulates other agents to adopt certain tools; and \textit{Free-Rider}, who hands off its own task to others. We demonstrate that strategically crafted prompts can introduce systematic biases in MAS behavior and executable instructions, enabling malicious agents to effectively mislead evaluation systems and manipulate collaborative agents. Furthermore, our attacks can bypass advanced LLM-based monitors, such as GPT-4o-mini and o3-mini, highlighting the limitations of current detection mechanisms. Our findings underscore the critical need for MAS architectures with robust security protocols and content validation mechanisms, alongside monitoring systems capable of comprehensive risk scenario assessment.
