Table of Contents
Fetching ...

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, Pin-Yu Chen

TL;DR

Prompt injection attacks compromise LLMs by exploiting the model's attention to ignore original instructions. The authors reveal a distraction effect where certain attention heads redirect focus from the instruction to the injected prompt, and they propose Attention Tracker, a training-free detector that uses preidentified important heads to measure focus on the instruction. The method requires no data or training and demonstrates strong generalization across models, datasets, and attack types, achieving substantial AUROC gains over prior methods, including training-free baselines, even on small LLMs. This work advances practical defenses for LLM integrated systems and provides insights into the internal mechanics of attention during prompt injection.

Abstract

Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

TL;DR

Prompt injection attacks compromise LLMs by exploiting the model's attention to ignore original instructions. The authors reveal a distraction effect where certain attention heads redirect focus from the instruction to the injected prompt, and they propose Attention Tracker, a training-free detector that uses preidentified important heads to measure focus on the instruction. The method requires no data or training and demonstrates strong generalization across models, datasets, and attack types, achieving substantial AUROC gains over prior methods, including training-free baselines, even on small LLMs. This work advances practical defenses for LLM integrated systems and provides insights into the internal mechanics of attention during prompt injection.

Abstract

Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.

Paper Structure

This paper contains 32 sections, 5 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of : This figure illustrates the detection pipeline of and highlights the distraction effect caused by prompt injection attacks. For normal data, the attention of the last token typically focuses on the original instruction. However, when dealing with attack data, which often includes a separator and an injected instruction (e.g., print "hacked"), the attention shifts from the original instruction to the injected instruction. By leveraging this distraction effect, tracks the total attention score from the last token to the instruction prompt within important heads to detect prompt injection attacks.
  • Figure 2: Distraction Effect of Prompt Injection Attack: (a) Attention scores summed from the last token to the instruction prompt across different layers and heads. (b) Attention scores from the last token to tokens in the prompt across different layers. The figures show that for normal data, specific heads assign significantly higher attention scores to the instruction prompt than in attack cases. During an attack, attention shifts from the original instruction to the injected instruction, illustrating the distraction effect.
  • Figure 3: Distraction Effect of Different Attack Strategies: This figure shows the distribution of the aggregated $Attn^{l, h}(I){}$ across all layers and heads for different attacks on a subset of the Open-Prompt-Injection dataset liu2024formalizing. The legend indicates the color representing each attack strategy and the corresponding attack success rate (in round brackets).
  • Figure 4: Qualitative Analysis: The figure presents a qualitative analysis of the aggregation of important head's distribution through different tokens within normal and attack data, respectively.
  • Figure 5: Heads Generalization: The figure illustrates the mean difference in $Attn^{l, h}(I){}$ scores between normal data and attack data from the deepset prompt injection dataset huggingfaceDeepsetpromptinjectionsDatasets, the Open-Prompt-Injection benchmark liu2024formalizing, and the set of LLM-generated data we used to find important heads.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1