Table of Contents
Fetching ...

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

Runpeng Geng, Yanting Wang, Chenlong Yin, Minhao Cheng, Ying Chen, Jinyuan Jia

TL;DR

PISanitizer targets prompt injection in long-context LLMs by sanitizing injected tokens before response generation. It leverages a purposely designed sanitization instruction to elicit instruction-following behavior in an LLM and then uses attention signals to pinpoint and remove high-influence tokens, preserving utility for the target task. The approach shows near-zero attack success under diverse injections, robust performance against adaptive attacks, and efficient runtime on long contexts, outperforming existing defenses. This work advances secure deployment of long-context LLMs and suggests directions for integration with security policies and multi-modal systems.

Abstract

Long context LLMs are vulnerable to prompt injection, where an attacker can inject an instruction in a long context to induce an LLM to generate an attacker-desired output. Existing prompt injection defenses are designed for short contexts. When extended to long-context scenarios, they have limited effectiveness. The reason is that an injected instruction constitutes only a very small portion of a long context, making the defense very challenging. In this work, we propose PISanitizer, which first pinpoints and sanitizes potential injected tokens (if any) in a context before letting a backend LLM generate a response, thereby eliminating the influence of the injected instruction. To sanitize injected tokens, PISanitizer builds on two observations: (1) prompt injection attacks essentially craft an instruction that compels an LLM to follow it, and (2) LLMs intrinsically leverage the attention mechanism to focus on crucial input tokens for output generation. Guided by these two observations, we first intentionally let an LLM follow arbitrary instructions in a context and then sanitize tokens receiving high attention that drive the instruction-following behavior of the LLM. By design, PISanitizer presents a dilemma for an attacker: the more effectively an injected instruction compels an LLM to follow it, the more likely it is to be sanitized by PISanitizer. Our extensive evaluation shows that PISanitizer can successfully prevent prompt injection, maintain utility, outperform existing defenses, is efficient, and is robust to optimization-based and strong adaptive attacks. The code is available at https://github.com/sleeepeer/PISanitizer.

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

TL;DR

PISanitizer targets prompt injection in long-context LLMs by sanitizing injected tokens before response generation. It leverages a purposely designed sanitization instruction to elicit instruction-following behavior in an LLM and then uses attention signals to pinpoint and remove high-influence tokens, preserving utility for the target task. The approach shows near-zero attack success under diverse injections, robust performance against adaptive attacks, and efficient runtime on long contexts, outperforming existing defenses. This work advances secure deployment of long-context LLMs and suggests directions for integration with security policies and multi-modal systems.

Abstract

Long context LLMs are vulnerable to prompt injection, where an attacker can inject an instruction in a long context to induce an LLM to generate an attacker-desired output. Existing prompt injection defenses are designed for short contexts. When extended to long-context scenarios, they have limited effectiveness. The reason is that an injected instruction constitutes only a very small portion of a long context, making the defense very challenging. In this work, we propose PISanitizer, which first pinpoints and sanitizes potential injected tokens (if any) in a context before letting a backend LLM generate a response, thereby eliminating the influence of the injected instruction. To sanitize injected tokens, PISanitizer builds on two observations: (1) prompt injection attacks essentially craft an instruction that compels an LLM to follow it, and (2) LLMs intrinsically leverage the attention mechanism to focus on crucial input tokens for output generation. Guided by these two observations, we first intentionally let an LLM follow arbitrary instructions in a context and then sanitize tokens receiving high attention that drive the instruction-following behavior of the LLM. By design, PISanitizer presents a dilemma for an attacker: the more effectively an injected instruction compels an LLM to follow it, the more likely it is to be sanitized by PISanitizer. Our extensive evaluation shows that PISanitizer can successfully prevent prompt injection, maintain utility, outperform existing defenses, is efficient, and is robust to optimization-based and strong adaptive attacks. The code is available at https://github.com/sleeepeer/PISanitizer.

Paper Structure

This paper contains 29 sections, 3 figures, 14 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of PISanitizer, which sanitizes a prompt before feeding it into a backend LLM.
  • Figure 2: Overview of PISanitizer for prompt sanitization of clean and contaminated contexts.
  • Figure 3: Impact of hyperparameters $\theta$, $d$, $w_s$ on PISanitizer under Combined Attack.