Table of Contents
Fetching ...

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, Florian Tramèr

TL;DR

AgentDojo introduces a dynamic, extensible benchmark for evaluating prompt injection attacks and defenses in tool-using LLM agents. By assembling 97 realistic tasks and 629 security test cases across four stateful environments, it measures both task utility and adversarial robustness under adaptive attacks. The framework supports modular agent pipelines and diverse defenses, revealing substantial yet nuanced security-utility tradeoffs and highlighting the need for stronger isolation mechanisms and adaptive attack evaluation. The authors release the code and document AgentDojo as a live benchmark to track progress in safe, robust AI agent design.

Abstract

AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner.. We release the code for AgentDojo at https://github.com/ethz-spylab/agentdojo.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

TL;DR

AgentDojo introduces a dynamic, extensible benchmark for evaluating prompt injection attacks and defenses in tool-using LLM agents. By assembling 97 realistic tasks and 629 security test cases across four stateful environments, it measures both task utility and adversarial robustness under adaptive attacks. The framework supports modular agent pipelines and diverse defenses, revealing substantial yet nuanced security-utility tradeoffs and highlighting the need for stronger isolation mechanisms and adaptive attack evaluation. The authors release the code and document AgentDojo as a live benchmark to track progress in safe, robust AI agent design.

Abstract

AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner.. We release the code for AgentDojo at https://github.com/ethz-spylab/agentdojo.
Paper Structure (70 sections, 21 figures, 5 tables)

This paper contains 70 sections, 21 figures, 5 tables.

Figures (21)

  • Figure 1: AgentDojo evaluates the utility and security of AI agents in dynamic tool-calling environments with untrusted data. Researchers can define user and attacker goals to evaluate the progress of AI agents, prompt injections attacks, and defenses.
  • Figure 2: AgentDojo is challenging. Our tasks are harder than the Berkeley Tool Calling Leaderboard yan2024fcleaderboard in benign settings; attacks further increase difficulty.
  • Figure 3: A stateful environment. The state tracks an email inbox, a calendar and a cloud drive.
  • Figure 4: A tool definition. This tool returns appointments by querying the calendar state.
  • Figure 5: A user task definition. This task instructs the agent to summarize calendar appointments.
  • ...and 16 more figures