Table of Contents
Fetching ...

Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

Yinan Zhong, Qianhao Miao, Yanjiao Chen, Jiangyi Deng, Yushi Cheng, Wenyuan Xu

TL;DR

Rennervate tackles Indirect Prompt Injection in LLM-driven applications by introducing token-level detection that harnesses attention features, combined with a 2-step attentive pooling mechanism for robust generalization. It couples this detector with an injection identifier and a sanitization module to remove injected tokens while preserving benign instruction functionality. The authors validate Rennervate across five diverse LLMs and a new large-scale FIPI dataset, demonstrating superior detection and sanitization performance, transferability to unseen attacks, and resilience to adaptive adversaries. The work also provides extensive ablations and hyperparameter analyses, and discusses practical deployment considerations and limitations. Overall, Rennervate offers a compact, non-intrusive defense that advances robust IPI protection for real-world, LLM-integrated systems, with open questions around recovery from injected-task-specific content and multi-modal threats.

Abstract

Large Language Models (LLMs) have been integrated into many applications (e.g., web agents) to perform more sophisticated tasks. However, LLM-empowered applications are vulnerable to Indirect Prompt Injection (IPI) attacks, where instructions are injected via untrustworthy external data sources. This paper presents Rennervate, a defense framework to detect and prevent IPI attacks. Rennervate leverages attention features to detect the covert injection at a fine-grained token level, enabling precise sanitization that neutralizes IPI attacks while maintaining LLM functionalities. Specifically, the token-level detector is materialized with a 2-step attentive pooling mechanism, which aggregates attention heads and response tokens for IPI detection and sanitization. Moreover, we establish a fine-grained IPI dataset, FIPI, to be open-sourced to support further research. Extensive experiments verify that Rennervate outperforms 15 commercial and academic IPI defense methods, achieving high precision on 5 LLMs and 6 datasets. We also demonstrate that Rennervate is transferable to unseen attacks and robust against adaptive adversaries.

Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

TL;DR

Rennervate tackles Indirect Prompt Injection in LLM-driven applications by introducing token-level detection that harnesses attention features, combined with a 2-step attentive pooling mechanism for robust generalization. It couples this detector with an injection identifier and a sanitization module to remove injected tokens while preserving benign instruction functionality. The authors validate Rennervate across five diverse LLMs and a new large-scale FIPI dataset, demonstrating superior detection and sanitization performance, transferability to unseen attacks, and resilience to adaptive adversaries. The work also provides extensive ablations and hyperparameter analyses, and discusses practical deployment considerations and limitations. Overall, Rennervate offers a compact, non-intrusive defense that advances robust IPI protection for real-world, LLM-integrated systems, with open questions around recovery from injected-task-specific content and multi-modal threats.

Abstract

Large Language Models (LLMs) have been integrated into many applications (e.g., web agents) to perform more sophisticated tasks. However, LLM-empowered applications are vulnerable to Indirect Prompt Injection (IPI) attacks, where instructions are injected via untrustworthy external data sources. This paper presents Rennervate, a defense framework to detect and prevent IPI attacks. Rennervate leverages attention features to detect the covert injection at a fine-grained token level, enabling precise sanitization that neutralizes IPI attacks while maintaining LLM functionalities. Specifically, the token-level detector is materialized with a 2-step attentive pooling mechanism, which aggregates attention heads and response tokens for IPI detection and sanitization. Moreover, we establish a fine-grained IPI dataset, FIPI, to be open-sourced to support further research. Extensive experiments verify that Rennervate outperforms 15 commercial and academic IPI defense methods, achieving high precision on 5 LLMs and 6 datasets. We also demonstrate that Rennervate is transferable to unseen attacks and robust against adaptive adversaries.

Paper Structure

This paper contains 46 sections, 5 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: A toy example illustrating an IPI attack and our proposed defense method: (a) An adversary injects adversarial instructions into external data sources to goal-hijack the LLM-integrated application. (b) Rennervate detects whether the retrieved data has been compromised by an IPI attack, and sanitizes the injections to maintain the benign functionality of the LLM-integrated application.
  • Figure 2: Design of Rennervate. Rennervate leverages attention features during the inference phase of LLMs. The token-level detector ➀ identifies suspicious tokens and outputs the corresponding logits. The injection identifier ➁ filters these logits and determines whether the textual data $\mathbf{X}$ has been injected. Additionally, the injection sanitizer ➂ sanitizes $\mathbf{X}$ to mitigate the IPI attack, maintaining the benign functionality of the LLM-integrated application.
  • Figure 3: The design of the Token-Level Detector. We utilize the 2-step attentive pooling to automatically aggregate response tokens and attention heads based on their importance. Then $N$ residual blocks are applied to further process the feature. We use cross-entropy loss during the training phase.
  • Figure 4: Jaccard similarity ($\uparrow$) between the sanitized data and the clean data, evaluated across different target LLMs and various datasets, including FIPI, MRPC-HSOL (M-H), Jfleg-RTE (J-R), SST2-MRPC (S-M), MRPC-SST2 (M-S), and RTE-Jfleg (R-J). Rennervate effectively sanitizes injections while having minimal impact on the integrity of the original data.
  • Figure 5: Jaccard similarity ($\uparrow$) between the sanitized data and the clean data, evaluated across different target LLMs and various datasets, including FIPI, MRPC-HSOL (M-H), Jfleg-RTE (J-R), SST2-MRPC (S-M), MRPC-SST2 (M-S), and RTE-Jfleg (R-J).
  • ...and 3 more figures