Table of Contents
Fetching ...

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Mintong Kang, Chong Xiang, Sanjay Kariyappa, Chaowei Xiao, Bo Li, Edward Suh

TL;DR

Indirect prompt injection poses a critical risk for LLM-powered agents. The authors propose IntentGuard, a general defense that utilizes an instruction-following intent analyzer (IIA) to reveal which instructions the model intends to follow and whether they originate from untrusted data, enabling alerting or masking as mitigation. They instantiate the IIA using three thinking-intervention strategies to elicit structured intent from reasoning-enabled LLMs and implement a single-pass, model-internal defense. Empirical results on AgentDojo and Mind2Web with Qwen-3-32B and gpt-oss-20B show negligible utility loss and strong robustness against adaptive attacks, suggesting broad applicability and effectiveness of intent-based defenses.

Abstract

Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLM-powered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of IntentGuard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three "thinking intervention" strategies to elicit a structured list of intended instructions from reasoning-enabled LLMs. These techniques include start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration. We evaluate IntentGuard on two agentic benchmarks (AgentDojo and Mind2Web) using two reasoning-enabled LLMs (Qwen-3-32B and gpt-oss-20B). Results demonstrate that IntentGuard achieves (1) no utility degradation in all but one setting and (2) strong robustness against adaptive prompt injection attacks (e.g., reducing attack success rates from 100% to 8.5% in a Mind2Web scenario).

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

TL;DR

Indirect prompt injection poses a critical risk for LLM-powered agents. The authors propose IntentGuard, a general defense that utilizes an instruction-following intent analyzer (IIA) to reveal which instructions the model intends to follow and whether they originate from untrusted data, enabling alerting or masking as mitigation. They instantiate the IIA using three thinking-intervention strategies to elicit structured intent from reasoning-enabled LLMs and implement a single-pass, model-internal defense. Empirical results on AgentDojo and Mind2Web with Qwen-3-32B and gpt-oss-20B show negligible utility loss and strong robustness against adaptive attacks, suggesting broad applicability and effectiveness of intent-based defenses.

Abstract

Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLM-powered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of IntentGuard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three "thinking intervention" strategies to elicit a structured list of intended instructions from reasoning-enabled LLMs. These techniques include start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration. We evaluate IntentGuard on two agentic benchmarks (AgentDojo and Mind2Web) using two reasoning-enabled LLMs (Qwen-3-32B and gpt-oss-20B). Results demonstrate that IntentGuard achieves (1) no utility degradation in all but one setting and (2) strong robustness against adaptive prompt injection attacks (e.g., reducing attack success rates from 100% to 8.5% in a Mind2Web scenario).

Paper Structure

This paper contains 15 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: IntentGuard consists of three steps: (1) Intent Extraction: use an instruction-following intent analyzer (IIA) to extract a list of instructions the LLM intends to follow. (2) Origin Tracing: trace each instruction back to its origin via sliding window matching. (3) Injection Mitigation: if any instruction originates from an untrusted data segment (tool response), either alert the user for confirmation (alert mode) or mask out the suspicious region and regenerate (recovery mode).
  • Figure 2: Building IIA with Thinking Intervention. In the figure, blue bold text indicates Thinking Intervention content, while black text indicates LLM-generated content. (1) Start-of-Thinking Prefilling: prefill the beginning of the reasoning chain to encourage the model to generate a structured list of intended instructions. (2) End-of-Thinking Refinement: upon detecting the first "</think>" token, replace it with "Now, let me refine..." to enforce refinement of the instruction list. (3) In-Context Demonstration: prepend an example reasoning trace with a structured list of instructions to guide the model toward this reasoning pattern.
  • Figure 3: IntentGuard performance with different IIA design choices (Qwen3-32B)
  • Figure 4: Confusion matrix of Qwen3-32B with IIA on AgentDojo under PAIR attack.

Theorems & Definitions (1)

  • Definition 1: instruction-following intent analyzer (IIA)