Table of Contents
Fetching ...

PEAR: Planner-Executor Agent Robustness Benchmark

Shen Dong, Mingxuan Zhang, Pengfei He, Li Ma, Bhavani Thuraisingham, Hui Liu, Yue Xing

TL;DR

PEAR presents the first comprehensive benchmark for evaluating the security of planner–executor LLM-based multi-agent systems, uniting task utility and vulnerability under a unified framework. By testing four user-task scenarios with diverse attack modalities and injection surfaces, the study reveals a robust trade-off: stronger planner-executor configurations deliver higher task performance yet exhibit greater susceptibility to adversarial prompts and inter-agent manipulation. Key findings show memory improves planner performance, planner-focused attacks are particularly damaging, and injection attacks significantly raise end-to-end ASR across model families. The work provides actionable insights and a foundation for defenses that guard both external interactions and internal prompts in real-world MAS deployments.

Abstract

Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.

PEAR: Planner-Executor Agent Robustness Benchmark

TL;DR

PEAR presents the first comprehensive benchmark for evaluating the security of planner–executor LLM-based multi-agent systems, uniting task utility and vulnerability under a unified framework. By testing four user-task scenarios with diverse attack modalities and injection surfaces, the study reveals a robust trade-off: stronger planner-executor configurations deliver higher task performance yet exhibit greater susceptibility to adversarial prompts and inter-agent manipulation. Key findings show memory improves planner performance, planner-focused attacks are particularly damaging, and injection attacks significantly raise end-to-end ASR across model families. The work provides actionable insights and a foundation for defenses that guard both external interactions and internal prompts in real-world MAS deployments.

Abstract

Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.

Paper Structure

This paper contains 32 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An overview of PEAR. The center panel shows the planner–executor MAS alongside example user tasks. The left shows the attack tasks; red arrows mark where these attacks are injected into MAS components. The right demonstrates the evaluation metrics used in our study.
  • Figure 2: Utility comparison across different planner–executor configurations for the Claude, Deepseek, GPT, and Gemini families. Each bar corresponds to a specific combination of planner/executor pairs, as indicated on the X-axis. Bars indicate mean utility, with error bars showing standard deviations. Detailed numerical results are reported in Table \ref{['tab: utility']} in Appendix \ref{['sec:appendix:exp']}.
  • Figure 3: ASR versus utility comparison under harmful and privacy tasks with error bar included for utility (horizontal) and ASR (vertical).