Table of Contents
Fetching ...

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Jafar Isbarov, Murat Kantarcioglu

TL;DR

The paper tackles the fragility of monitoring-based defenses for autonomous AI agents under indirect prompt injection (IPI) by introducing Agent-as-a-Proxy attacks, where agents are weaponized as delivery channels to bypass monitors. It proposes Parallel-GCG, a gradient-based optimization that crafts adversarial strings robust across multiple reasoning and execution contexts, and demonstrates high attack success rates against multiple monitoring paradigms on the AgentDojo benchmark. The results show that even frontier-scale monitors can be bypassed, and that simply scaling monitoring models does not ensure security, particularly when adaptive, multi-site injections are considered. The work argues for architecture-level security guarantees rather than relying on model scale or runtime oversight alone, highlighting the urgent need for robust defenses against adaptive IPI in agentic systems.

Abstract

As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

TL;DR

The paper tackles the fragility of monitoring-based defenses for autonomous AI agents under indirect prompt injection (IPI) by introducing Agent-as-a-Proxy attacks, where agents are weaponized as delivery channels to bypass monitors. It proposes Parallel-GCG, a gradient-based optimization that crafts adversarial strings robust across multiple reasoning and execution contexts, and demonstrates high attack success rates against multiple monitoring paradigms on the AgentDojo benchmark. The results show that even frontier-scale monitors can be bypassed, and that simply scaling monitoring models does not ensure security, particularly when adaptive, multi-site injections are considered. The work argues for architecture-level security guarantees rather than relying on model scale or runtime oversight alone, highlighting the urgent need for robust defenses against adaptive IPI in agentic systems.

Abstract

As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.
Paper Structure (26 sections, 6 equations, 5 figures, 6 tables)

This paper contains 26 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Pareto frontier of AlignmentCheck. We evaluate AlignmentCheck using open-weight LLM against AgentDojo. We use these results to choose the target monitoring LLMs.
  • Figure 2: Impact of optimization set size on attack success. While attack success improves modestly with larger optimization sets (k=1 to k=16), a single trace (k=1) is sufficient to achieve substantial success, showing that the attack bypasses AlignmentCheck without requiring extensive data. Static attacks from AgentDojo serve as the baseline.
  • Figure 3: Optimizing for a simplified monitoring setting is sufficient to break real-time monitoring. We compare attack performance when AlignmentCheck evaluates the full trace at once (no-replay mode) versus step-by-step (replay mode). Although our attack strings are optimized using the full trace, they remain highly effective in the realistic replay setting, confirming the threat is viable in live deployment scenarios where the monitor observes the agent in real-time.
  • Figure 4: Parallel-GCG outperforms GCG. We compare the attack success rate (ASR) of GCG-optimized strings with Parallel-GCG-optimized strings. Parallel-GCG achieves higher ASR, particularly against models with lower baseline ASR, validating the need to account for token repetition in agentic contexts.
  • Figure 5: Hybrid monitoring achieves a higher detection rate by monitoring the chain-of-thought (CoT) and tool calls at the same time. However, this creates a vulnerability where prompt injection can be used not only to misuse the agent, but also to bypass the monitor.