Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents
Sai Puppala, Ismail Hossain, Md Jahangir Alam, Yoonpyo Lee, Jay Yoo, Tanzim Ahad, Syed Bahauddin Alam, Sajedul Talukder
TL;DR
The paper addresses the risk of safety failures in deep research agents that persist across turns and interact with tools and memory. It introduces AgentFence, an architecture-centric evaluation with a taxonomy of 14 attack classes and trace-auditable conversation breaks to measure trajectory-level security breaches while keeping the base model fixed. Empirical results across eight agent archetypes show substantial architectural differences in mean security break rate, with MSBR ranging from $0.29 \\pm 0.04$ to $0.51 \\pm 0.07$, and dominant failures arising from boundary and authority violations (e.g., Denial-of-Wallet, Authorization Confusion, Retrieval Poisoning, Planning Manipulation). The findings highlight that many operational vulnerabilities stem from trust-boundary design choices rather than prompt content, underscoring the need for architecture-aware security evaluations and reproducible artifact distributions to guide safer autonomous AI systems. AgentFence thus offers a principled, reusable framework for comparing architectures and diagnosing structural risks before large-scale deployment, with practical implications for enforcing correct goal framing, authority boundaries, and robust state management in deep agents.
Abstract
Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from $0.29 \pm 0.04$ (LangGraph) to $0.51 \pm 0.07$ (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet ($0.62 \pm 0.08$), Authorization Confusion ($0.54 \pm 0.10$), Retrieval Poisoning ($0.47 \pm 0.09$), and Planning Manipulation ($0.44 \pm 0.11$), while prompt-centric classes remain below $0.20$ under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ($ρ\approx 0.63$ and $ρ\approx 0.58$). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.
