Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks
Jafar Isbarov, Murat Kantarcioglu
TL;DR
The paper tackles the fragility of monitoring-based defenses for autonomous AI agents under indirect prompt injection (IPI) by introducing Agent-as-a-Proxy attacks, where agents are weaponized as delivery channels to bypass monitors. It proposes Parallel-GCG, a gradient-based optimization that crafts adversarial strings robust across multiple reasoning and execution contexts, and demonstrates high attack success rates against multiple monitoring paradigms on the AgentDojo benchmark. The results show that even frontier-scale monitors can be bypassed, and that simply scaling monitoring models does not ensure security, particularly when adaptive, multi-site injections are considered. The work argues for architecture-level security guarantees rather than relying on model scale or runtime oversight alone, highlighting the urgent need for robust defenses against adaptive IPI in agentic systems.
Abstract
As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.
