Table of Contents
Fetching ...

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

Ruofan Liu, Yun Lin, Zhiyong Huang, Jin Song Dong

TL;DR

DRIP reframes prompt-injection defense as a representation-editing problem complemented by a residual instruction fusion architecture. It introduces a token-wise deinstruction shift g(e_d) and a contrastive DPO-based training regime over carefully curated data to push instruction-like tokens in data away from the instruction manifold while preserving data semantics. A residual fusion pathway anchors the final output to the true instruction, significantly reducing adversarial overwriting and enabling robust performance across adaptive attacks without utility loss. Evaluations on open-source LLMs (LLaMA-8B and Mistral-7B) show state-of-the-art improvements on SEP, AlpacaFarm, and InjecAgent benchmarks, with high IFEval and AlpacaEval-2.0 utility comparable to undefended models. The work suggests a practical, scalable approach to hardening LLMs against prompt injection through targeted representation control and architectural safeguards.

Abstract

Large language models (LLMs) are increasingly integrated into IT infrastructures, where they process user data according to predefined instructions. However, conventional LLMs remain vulnerable to prompt injection, where malicious users inject directive tokens into the data to subvert model behavior. Existing defenses train LLMs to semantically separate data and instruction tokens, but still struggle to (1) balance utility and security and (2) prevent instruction-like semantics in the data from overriding the intended instructions. We propose DRIP, which (1) precisely removes instruction semantics from tokens in the data section while preserving their data semantics, and (2) robustly preserves the effect of the intended instruction even under strong adversarial content. To "de-instructionalize" data tokens, DRIP introduces a data curation and training paradigm with a lightweight representation-editing module that edits embeddings of instruction-like tokens in the data section, enhancing security without harming utility. To ensure non-overwritability of instructions, DRIP adds a minimal residual module that reduces the ability of adversarial data to overwrite the original instruction. We evaluate DRIP on LLaMA 8B and Mistral 7B against StruQ, SecAlign, ISE, and PFT on three prompt-injection benchmarks (SEP, AlpacaFarm, and InjecAgent). DRIP improves role-separation score by 12-49\%, reduces attack success rate by over 66\% under adaptive attacks, and matches the utility of the undefended model, establishing a new state of the art for prompt-injection robustness.

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

TL;DR

DRIP reframes prompt-injection defense as a representation-editing problem complemented by a residual instruction fusion architecture. It introduces a token-wise deinstruction shift g(e_d) and a contrastive DPO-based training regime over carefully curated data to push instruction-like tokens in data away from the instruction manifold while preserving data semantics. A residual fusion pathway anchors the final output to the true instruction, significantly reducing adversarial overwriting and enabling robust performance across adaptive attacks without utility loss. Evaluations on open-source LLMs (LLaMA-8B and Mistral-7B) show state-of-the-art improvements on SEP, AlpacaFarm, and InjecAgent benchmarks, with high IFEval and AlpacaEval-2.0 utility comparable to undefended models. The work suggests a practical, scalable approach to hardening LLMs against prompt injection through targeted representation control and architectural safeguards.

Abstract

Large language models (LLMs) are increasingly integrated into IT infrastructures, where they process user data according to predefined instructions. However, conventional LLMs remain vulnerable to prompt injection, where malicious users inject directive tokens into the data to subvert model behavior. Existing defenses train LLMs to semantically separate data and instruction tokens, but still struggle to (1) balance utility and security and (2) prevent instruction-like semantics in the data from overriding the intended instructions. We propose DRIP, which (1) precisely removes instruction semantics from tokens in the data section while preserving their data semantics, and (2) robustly preserves the effect of the intended instruction even under strong adversarial content. To "de-instructionalize" data tokens, DRIP introduces a data curation and training paradigm with a lightweight representation-editing module that edits embeddings of instruction-like tokens in the data section, enhancing security without harming utility. To ensure non-overwritability of instructions, DRIP adds a minimal residual module that reduces the ability of adversarial data to overwrite the original instruction. We evaluate DRIP on LLaMA 8B and Mistral 7B against StruQ, SecAlign, ISE, and PFT on three prompt-injection benchmarks (SEP, AlpacaFarm, and InjecAgent). DRIP improves role-separation score by 12-49\%, reduces attack success rate by over 66\% under adaptive attacks, and matches the utility of the undefended model, establishing a new state of the art for prompt-injection robustness.

Paper Structure

This paper contains 47 sections, 5 theorems, 67 equations, 15 figures, 8 tables.

Key Result

Theorem 1

Under Assumption assump:manifold-sep, consider a token $x_a$ for which both losses are non-trivial (scores are finite and not saturated). Then any first-order stationary point of $\mathcal{L}$ w.r.t. $\mathbf{e}(x_a)$ and $\mathbf{g}$ satisfies:

Figures (15)

  • Figure 1: The primary task is translation, while the data introduces a diverting task that asks for the capital of France. Conservative defenses can remove all instruction-like data, but this leads to information loss. We propose de-instructing instead of removing. In that case, the diverting task is safely translated.
  • Figure 2: Example of an optimization-based attack generated by GCG gcg. The injected instruction is print hacked, and the injected suffix maximizes the chance of generating hacked as the next token.
  • Figure 3: Overview of DRIP. An input prompt consists of two segments: a trusted instruction and untrusted data. After tokenization, input embeddings, and positional encoding, DRIP applies a de-instruction shift (Section \ref{['sec:deinstruction-shift']}) to data tokens to suppress semantics that may distract from the intended task. At the output stage, the model fuses the final hidden state with the last instruction token’s state (Section \ref{['sec:reinstruction-fusion']}) before passing it to the LM head. Autoregressive generation then proceeds as usual.
  • Figure 4: Data curation pipeline. One DPO pair generates a preferred and a rejected response. The first step generates the ground-truth response by querying the LLM. The second step is an LLM-as-judge to verify that the injected task is not executed. The two steps iteratively refine the response until the preferred response is correct. Note that only the preferred response needs to go through the extra auditing.
  • Figure 5: On the LHS, the primary task is to rewrite the paragraph with modern language, and the injected task is asking the name of the river that runs through London. DRIP successfully de-instructs the injected task and rewrites it. On the RHS, the injected task is the true top-level instruction, DRIP can successfully answer it.
  • ...and 10 more figures

Theorems & Definitions (10)

  • Theorem 1: Directional separation from Case 1--3
  • proof
  • Lemma 1: Suffix-to-logit Lipschitz constants
  • proof
  • Theorem 2: Margin-based robustness of instruction fusion
  • proof
  • Theorem 3: Sum fusion preserves information
  • proof
  • Theorem 4: Concat fusion has an information bottleneck
  • proof