Optimizing Agent Planning for Security and Autonomy
Aashish Kolluri, Rishi Sharma, Manuel Costa, Boris Köpf, Tobias Nießen, Mark Russinovich, Shruti Tople, Santiago Zanella-Béguelin
TL;DR
This paper addresses indirect prompt injection (PIA) attacks on AI agents by advocating deterministic, IFC-based defenses that guarantee policy-compliant actions but can hinder task performance. It introduces autonomy metrics, HITL load and TCR@k, to quantify the reduction in human oversight enabled by IFC and policy-aware planning. The authors present Prudentia, an IFC-aware planner built atop Fides, incorporating policy-awareness, strategic variable expansion, and endorsement- versus approval-based data handling to maximize autonomy while preserving security and task utility. Evaluations on AgentDojo and WASP demonstrate that Prudentia achieves higher autonomy than prior IFC-based defenses with comparable or improved task completion, including zero HITL for data-independent tasks, underscoring the practical impact of policy-aware planning for secure, autonomous agents.
Abstract
Indirect prompt injection attacks threaten AI agents that execute consequential actions, motivating deterministic system-level defenses. Such defenses can provably block unsafe actions by enforcing confidentiality and integrity policies, but currently appear costly: they reduce task completion rates and increase token usage compared to probabilistic defenses. We argue that existing evaluations miss a key benefit of system-level defenses: reduced reliance on human oversight. We introduce autonomy metrics to quantify this benefit: the fraction of consequential actions an agent can execute without human-in-the-loop (HITL) approval while preserving security. To increase autonomy, we design a security-aware agent that (i) introduces richer HITL interactions, and (ii) explicitly plans for both task progress and policy compliance. We implement this agent design atop an existing information-flow control defense against prompt injection and evaluate it on the AgentDojo and WASP benchmarks. Experiments show that this approach yields higher autonomy without sacrificing utility.
