Table of Contents
Fetching ...

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Debeshee Das, Luca Beurer-Kellner, Marc Fischer, Maximilian Baader

TL;DR

CommandSans reframes indirect prompt-injection defense as token-level instruction sanitization, removing AI-directed instructions from tool outputs rather than blocking content. By training a small encoder-based model on instruction-tuning data, it achieves non-blocking, context-agnostic protection that generalizes across diverse attacks. Evaluated on five benchmarks and via human red-teaming, it delivers substantial ASR reductions (up to 19x) with minimal utility loss, addressing practical deployment concerns such as latency and false positives. This approach offers a scalable, industry-ready pathway for securing AI agents operating with external tools.

Abstract

The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

TL;DR

CommandSans reframes indirect prompt-injection defense as token-level instruction sanitization, removing AI-directed instructions from tool outputs rather than blocking content. By training a small encoder-based model on instruction-tuning data, it achieves non-blocking, context-agnostic protection that generalizes across diverse attacks. Evaluated on five benchmarks and via human red-teaming, it delivers substantial ASR reductions (up to 19x) with minimal utility loss, addressing practical deployment concerns such as latency and false positives. This approach offers a scalable, industry-ready pathway for securing AI agents operating with external tools.

Abstract

The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.

Paper Structure

This paper contains 40 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Comparing traditional sample-level defenses with our sanitization-based approach.
  • Figure 2: Our approach consists of three stages: (1) Data curation from instruction-tuning datasets (BFCL, OpenOrca) and synthetic tool outputs, followed by LLM-based labeling to identify AI-directed instructions; (2) Training a small, fast masked language model (XLM-RoBERTa) for binary token-level classification of instruction vs. non-instruction tokens; (3) Deployment as a prompt sanitizer that removes instructions from AI agent tool outputs before they enter the LLM's context.
  • Figure 3: Annotated training sample with <instruction> tags inserted by our LLM labeler.
  • Figure 4: Security vs. Utility tradeoff under attack (GPT-4o on AgentDojo). Security = $1-ASR$.
  • Figure 4: Attack Success Rates (ASR) in % of InjecAgent Enhanced setting results on GPT-4.
  • ...and 3 more figures