Table of Contents
Fetching ...

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

Sai Puppala, Ismail Hossain, Md Jahangir Alam, Yoonpyo Lee, Jay Yoo, Tanzim Ahad, Syed Bahauddin Alam, Sajedul Talukder

TL;DR

The paper addresses the risk of safety failures in deep research agents that persist across turns and interact with tools and memory. It introduces AgentFence, an architecture-centric evaluation with a taxonomy of 14 attack classes and trace-auditable conversation breaks to measure trajectory-level security breaches while keeping the base model fixed. Empirical results across eight agent archetypes show substantial architectural differences in mean security break rate, with MSBR ranging from $0.29 \\pm 0.04$ to $0.51 \\pm 0.07$, and dominant failures arising from boundary and authority violations (e.g., Denial-of-Wallet, Authorization Confusion, Retrieval Poisoning, Planning Manipulation). The findings highlight that many operational vulnerabilities stem from trust-boundary design choices rather than prompt content, underscoring the need for architecture-aware security evaluations and reproducible artifact distributions to guide safer autonomous AI systems. AgentFence thus offers a principled, reusable framework for comparing architectures and diagnosing structural risks before large-scale deployment, with practical implications for enforcing correct goal framing, authority boundaries, and robust state management in deep agents.

Abstract

Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from $0.29 \pm 0.04$ (LangGraph) to $0.51 \pm 0.07$ (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet ($0.62 \pm 0.08$), Authorization Confusion ($0.54 \pm 0.10$), Retrieval Poisoning ($0.47 \pm 0.09$), and Planning Manipulation ($0.44 \pm 0.11$), while prompt-centric classes remain below $0.20$ under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ($ρ\approx 0.63$ and $ρ\approx 0.58$). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

TL;DR

The paper addresses the risk of safety failures in deep research agents that persist across turns and interact with tools and memory. It introduces AgentFence, an architecture-centric evaluation with a taxonomy of 14 attack classes and trace-auditable conversation breaks to measure trajectory-level security breaches while keeping the base model fixed. Empirical results across eight agent archetypes show substantial architectural differences in mean security break rate, with MSBR ranging from to , and dominant failures arising from boundary and authority violations (e.g., Denial-of-Wallet, Authorization Confusion, Retrieval Poisoning, Planning Manipulation). The findings highlight that many operational vulnerabilities stem from trust-boundary design choices rather than prompt content, underscoring the need for architecture-aware security evaluations and reproducible artifact distributions to guide safer autonomous AI systems. AgentFence thus offers a principled, reusable framework for comparing architectures and diagnosing structural risks before large-scale deployment, with practical implications for enforcing correct goal framing, authority boundaries, and robust state management in deep agents.

Abstract

Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from (LangGraph) to (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet (), Authorization Confusion (), Retrieval Poisoning (), and Planning Manipulation (), while prompt-centric classes remain below under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ( and ). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.
Paper Structure (38 sections, 1 equation, 4 figures, 5 tables)

This paper contains 38 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Deep Research Agent Workflow Illustration.
  • Figure 2: The figure above illustrates how the primary attack surfaces shift across each phase of the Deep Research agent lifecycle—from Context and Planning, through Action/Tool use, and into Synthesis. In other words, it makes explicit where adversaries most effectively apply pressure at each stage, and which phases are most exposed to specific classes of attacks. The corresponding phase-by-phase mapping and definitions for every numbered item in the figure are provided in Table \ref{['tab:attack_sources']}.
  • Figure 3: Security break rate by attack class across Deep Research agent archetypes. Radial distance from the center represents conversation depth (turns), and each circle marks the observed vulnerability level for an agent. Results are aggregated over 91 conversation threads; denser clusters of bubbles indicate repeated occurrences of the same event.
  • Figure 4: Break-type composition aggregated across deep research agents, showing the distribution of security break types across different attack classes.