Table of Contents
Fetching ...

Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

Wenhui Zhu, Xuanzhao Dong, Xiwen Chen, Rui Cai, Peijie Qiu, Zhipeng Wang, Oana Frunza, Shao Tang, Jindong Gu, Yalin Wang

Abstract

The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single-turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi-step tool-calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface-level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool-input position, we revealed that the RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi-agent architectures.

Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs

Abstract

The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single-turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi-step tool-calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface-level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool-input position, we revealed that the RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi-agent architectures.

Paper Structure

This paper contains 14 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Example overview of the experimental framework based on the AgentDojo debenedetti2024agentdojo Banking suite. (A) illustrates the system environment and global state, including representative metadata such as account profiles and transaction histories. (B) shows the task schema, encompassing legitimate user objectives, available tool functions, and the adversarial injection payloads. (C) delineates the end-to-end execution trace. See more details in Sec. \ref{['subsect:preliminary']}.
  • Figure 2: Vulnerability analysis of LLM agents (e.g., Qwen-2.5-14B) under four distinct Indirect Prompt Injection (IPI) attacks. (A) illustrates the attack success metrics, quantified by the Hijack Rate across diverse scenarios. (B) characterizes the linguistic patterns observed in successful hijacking cases. We categorize these failures into four distinct behaviors: Immediate, where the agent complies without detectable reasoning; Heading, where the agent expresses uncertainty yet follows the injection; Rationalized, where the agent acknowledges the injection or deceptively frames the malicious action as part of the legitimate task; and Reluctant, where the agent exhibits verbal resistance but ultimately executes the hijacked command. These categories are identified via fixed-syntax filtering of the reasoning traces following corrupted tool-call results.
  • Figure 3: Illustration of three LLM performance bottlenecks across four IPI attacks. (A) depicts the number of assistant turns elapsed between the receipt of the injection and the execution of the malicious action. (B) and (C) present violin plots comparing the distributions of decision entropy in natural log units and mean log-probability, respectively, as they evolve across execution turns. The shaded regions represent $\pm 1$ standard deviation.
  • Figure 4: Performance evaluation of Qwen3-8B in hijacking detection experiments utilizing both Representation Engineering (RepE) methods. (A) displays the Receiver Operating Characteristic (ROC) curve, and (B) illustrates the Precision-Recall (PR) curve.