Attention Tracker: Detecting Prompt Injection Attacks in LLMs
Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, Pin-Yu Chen
TL;DR
Prompt injection attacks compromise LLMs by exploiting the model's attention to ignore original instructions. The authors reveal a distraction effect where certain attention heads redirect focus from the instruction to the injected prompt, and they propose Attention Tracker, a training-free detector that uses preidentified important heads to measure focus on the instruction. The method requires no data or training and demonstrates strong generalization across models, datasets, and attack types, achieving substantial AUROC gains over prior methods, including training-free baselines, even on small LLMs. This work advances practical defenses for LLM integrated systems and provides insights into the internal mechanics of attention during prompt injection.
Abstract
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.
