Table of Contents
Fetching ...

PromptArmor: Simple yet Effective Prompt Injection Defenses

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, Dawn Song

TL;DR

PromptArmor introduces a guardrail-based defense that uses an off-the-shelf LLM to detect and sanitize injected prompts before they influence LLM agents. By framing detection and extraction as a prompting task and employing fuzzy extraction, it achieves state-of-the-art false-positive/false-negative control and dramatically lowers attack success rates on AgentDojo. The approach is modular, computationally efficient, and benefits from improvements in base LLMs, with strong robustness to adaptive attacks and favorable ablations across model sizes and reasoning capabilities. This work provides a practical baseline for prompt-injection defenses and highlights the value of prompt-based guardrails in securing LLM-driven workflows.

Abstract

Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off-the-shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT-4o, GPT-4.1, or o4-mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also demonstrate PromptArmor's effectiveness against adaptive attacks and explore different strategies for prompting an LLM. We recommend that PromptArmor be adopted as a standard baseline for evaluating new defenses against prompt injection attacks.

PromptArmor: Simple yet Effective Prompt Injection Defenses

TL;DR

PromptArmor introduces a guardrail-based defense that uses an off-the-shelf LLM to detect and sanitize injected prompts before they influence LLM agents. By framing detection and extraction as a prompting task and employing fuzzy extraction, it achieves state-of-the-art false-positive/false-negative control and dramatically lowers attack success rates on AgentDojo. The approach is modular, computationally efficient, and benefits from improvements in base LLMs, with strong robustness to adaptive attacks and favorable ablations across model sizes and reasoning capabilities. This work provides a practical baseline for prompt-injection defenses and highlights the value of prompt-based guardrails in securing LLM-driven workflows.

Abstract

Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off-the-shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT-4o, GPT-4.1, or o4-mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also demonstrate PromptArmor's effectiveness against adaptive attacks and explore different strategies for prompting an LLM. We recommend that PromptArmor be adopted as a standard baseline for evaluating new defenses against prompt injection attacks.

Paper Structure

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of how PromptArmor defends against prompt injection attacks as a guardrail for LLM agents using an example from AgentDojo. As shown in the figure, the user asks the agent to make payment of the most recent transaction and the attacker injects the malicious instruction into the transaction history. Here, PromptArmor takes as input the potential input data and prompts of the LLM and flags if there is a potentially injected prompts. In this case, it is clear that the input contains two distinct instructions, which will be flagged by the PromptArmor as prompt injection. Then, PromptArmor will locate and remove the injected instruction, and the agent can continue to execute the original user instruction.
  • Figure 2: Detailed workflow of PromptArmor, which detect and remove the injected instruction from the model input.
  • Figure 3: Impact of model size and reasoning on detection performance and task utility in the Qwen3 family. Larger models (8B and 32B) achieve a better balance between security (low FPR/FNR) and utility (high UA/low ASR), with Qwen3-32B reaching near-optimal results regardless of reasoning mode. The smallest model, Qwen3-0.6B, exhibits extreme trade-offs.