Table of Contents
Fetching ...

Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Jonathan Nöther, Adish Singla, Goran Radanovic

TL;DR

This work introduces BAD-ACTS, a multi-environment benchmark and harm taxonomy to assess the robustness of LLM-based agentic systems against adversarial manipulation. It demonstrates that a single adversarial agent can robustly influence other agents to execute harmful target actions, with larger models often more vulnerable and centralized/hierarchical structures offering partial safety. Two baseline defenses—prompt engineering and Guardian Agents with message monitoring—are evaluated, with Guardian Agents delivering more consistent reductions in attack success while preserving normal operation. The study highlights key gaps in current safety training for agentic systems and outlines concrete avenues for expanding benchmarks, exploring diverse attack modalities, and integrating more robust, scalable defenses. The benchmark and results provide a practical platform for security research in agentic AI and a call for broader collaboration to harden multi-agent AI deployments.

Abstract

Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS

Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

TL;DR

This work introduces BAD-ACTS, a multi-environment benchmark and harm taxonomy to assess the robustness of LLM-based agentic systems against adversarial manipulation. It demonstrates that a single adversarial agent can robustly influence other agents to execute harmful target actions, with larger models often more vulnerable and centralized/hierarchical structures offering partial safety. Two baseline defenses—prompt engineering and Guardian Agents with message monitoring—are evaluated, with Guardian Agents delivering more consistent reductions in attack success while preserving normal operation. The study highlights key gaps in current safety training for agentic systems and outlines concrete avenues for expanding benchmarks, exploring diverse attack modalities, and integrating more robust, scalable defenses. The benchmark and results provide a practical platform for security research in agentic AI and a call for broader collaboration to harden multi-agent AI deployments.

Abstract

Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS

Paper Structure

This paper contains 37 sections, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Illustration of our proposed threat setting. An adversary is able to fully control one of the agents in the system, and aims to manipulate the other agents into performing a specific target action.
  • Figure 2: Illustrations of the four environments. Arrows $A\rightarrow B$ indicate the ability of agent $A$ to send messages to agent $B$.
  • Figure 3: Breakdown of the dataset with regards to environment and sub-category. More examples are assigned to categories with a greater amount of diversity, such as Misinformation and Malicious Files. All environments have roughly the same number of malicious actions, with the exception of Multi-Agent Debate, due the lack of tools which limits the amount of possible harmful actions.