A Critical Evaluation of Defenses against Prompt Injection Attacks
Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong
TL;DR
Prompt injection attacks exploit the inseparability of instructions and data in LLM prompts. The paper proposes a principled evaluation framework that jointly assesses defense effectiveness against diverse, adaptive attacks and general-purpose utility across broad benchmarks. Through case studies of StruQ, SecAlign, Instruction Hierarchy, PromptGuard, and Attention Tracker, it shows that many claimed defenses are less effective than advertised and often incur utility losses under adaptive threats. The work provides benchmarks, metrics, and open-source data to guide robust development of future defenses.
Abstract
Large Language Models (LLMs) are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue the need to assess defenses across two critical dimensions: (1) effectiveness, measured against both existing and adaptive prompt injection attacks involving diverse target and injected prompts, and (2) general-purpose utility, ensuring that the defense does not compromise the foundational capabilities of the LLM. Our critical evaluation reveals that prior studies have not followed such a comprehensive evaluation methodology. When assessed using this principled approach, we show that existing defenses are not as successful as previously reported. This work provides a foundation for evaluating future defenses and guiding their development. Our code and data are available at: https://github.com/PIEval123/PIEval.
