Table of Contents
Fetching ...

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

Zekun Li, Baolin Peng, Pengcheng He, Xifeng Yan

TL;DR

This work addresses the vulnerability of instruction-following LLMs to prompt injection by proposing an open-book QA benchmark that injects adversarial instructions into web-context results. It introduces performance-influence (PI) and instruction-discrimination rate (IDR) as core metrics, along with a derived robustness measure (PDR), and evaluates eight leading LLMs (proprietary and open-source) across four QA datasets with both context-relevant and context-irrelevant injections. The results reveal a pronounced robustness gap: strong instruction-following capability does not guarantee resilience to injected prompts, and even robust models can be compromised by specific phrases, highlighting the need to emphasize prompt-context comprehension and instruction discrimination in future work. The paper provides actionable insights into attack and defense mechanisms, showing that simple defenses help but are not universally effective, and offers a publicly available codebase to facilitate future research in this area.

Abstract

Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following, becoming increasingly crucial across various applications. However, this capability brings with it the risk of prompt injection attacks, where attackers inject instructions into LLMs' input to elicit undesirable actions or content. Understanding the robustness of LLMs against such attacks is vital for their safe implementation. In this work, we establish a benchmark to evaluate the robustness of instruction-following LLMs against prompt injection attacks. Our objective is to determine the extent to which LLMs can be influenced by injected instructions and their ability to differentiate between these injected and original target instructions. Through extensive experiments with leading instruction-following LLMs, we uncover significant vulnerabilities in their robustness to such attacks. Our results indicate that some models are overly tuned to follow any embedded instructions in the prompt, overly focusing on the latter parts of the prompt without fully grasping the entire context. By contrast, models with a better grasp of the context and instruction-following capabilities will potentially be more susceptible to compromise by injected instructions. This underscores the need to shift the focus from merely enhancing LLMs' instruction-following capabilities to improving their overall comprehension of prompts and discernment of instructions that are appropriate to follow. We hope our in-depth analysis offers insights into the underlying causes of these vulnerabilities, aiding in the development of future solutions. Code and data are available at https://github.com/Leezekun/instruction-following-robustness-eval

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

TL;DR

This work addresses the vulnerability of instruction-following LLMs to prompt injection by proposing an open-book QA benchmark that injects adversarial instructions into web-context results. It introduces performance-influence (PI) and instruction-discrimination rate (IDR) as core metrics, along with a derived robustness measure (PDR), and evaluates eight leading LLMs (proprietary and open-source) across four QA datasets with both context-relevant and context-irrelevant injections. The results reveal a pronounced robustness gap: strong instruction-following capability does not guarantee resilience to injected prompts, and even robust models can be compromised by specific phrases, highlighting the need to emphasize prompt-context comprehension and instruction discrimination in future work. The paper provides actionable insights into attack and defense mechanisms, showing that simple defenses help but are not universally effective, and offers a publicly available codebase to facilitate future research in this area.

Abstract

Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following, becoming increasingly crucial across various applications. However, this capability brings with it the risk of prompt injection attacks, where attackers inject instructions into LLMs' input to elicit undesirable actions or content. Understanding the robustness of LLMs against such attacks is vital for their safe implementation. In this work, we establish a benchmark to evaluate the robustness of instruction-following LLMs against prompt injection attacks. Our objective is to determine the extent to which LLMs can be influenced by injected instructions and their ability to differentiate between these injected and original target instructions. Through extensive experiments with leading instruction-following LLMs, we uncover significant vulnerabilities in their robustness to such attacks. Our results indicate that some models are overly tuned to follow any embedded instructions in the prompt, overly focusing on the latter parts of the prompt without fully grasping the entire context. By contrast, models with a better grasp of the context and instruction-following capabilities will potentially be more susceptible to compromise by injected instructions. This underscores the need to shift the focus from merely enhancing LLMs' instruction-following capabilities to improving their overall comprehension of prompts and discernment of instructions that are appropriate to follow. We hope our in-depth analysis offers insights into the underlying causes of these vulnerabilities, aiding in the development of future solutions. Code and data are available at https://github.com/Leezekun/instruction-following-robustness-eval
Paper Structure (33 sections, 5 equations, 7 figures, 2 tables)

This paper contains 33 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Example of our evaluation setup. The LLM is tasked with answering the user question (highlighted in green) using web search results that have been pre-injected with an adversarial question (highlighted in red). Although the LLM could initially generate the correct answer, it might be misled by the injected adversarial question.
  • Figure 2: Quantitative assessment of PDR and IDR metrics across four benchmark datasets. The exact mapping of model identifiers M1-M8 to their respective model names is provided in Table \ref{['tab:models']}.
  • Figure 3: Quantitative evaluation of PDR ($\downarrow$) against injections of context-irrelevant and relevant instructions.
  • Figure 4: Investigation of the effects of instruction injection position on performance. Higher PDR and lower IDR indicate decreased robustness.
  • Figure 5: Investigation of effects of order, attack, and defense strategies. The term "attack" denotes the addition of prefixes to injected instructions, as detailed in Section \ref{['sect:attack-defense']}.
  • ...and 2 more figures