Table of Contents
Fetching ...

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi

TL;DR

Backdoor-powered prompt injection attacks are shown to be more harmful than previous prompt injection attacks, nullifying existing prompt injection defense methods, even the instruction hierarchy techniques.

Abstract

With the development of technology, large language models (LLMs) have dominated the downstream natural language processing (NLP) tasks. However, because of the LLMs' instruction-following abilities and inability to distinguish the instructions in the data content, such as web pages from search engines, the LLMs are vulnerable to prompt injection attacks. These attacks trick the LLMs into deviating from the original input instruction and executing the attackers' target instruction. Recently, various instruction hierarchy defense strategies are proposed to effectively defend against prompt injection attacks via fine-tuning. In this paper, we explore more vicious attacks that nullify the prompt injection defense methods, even the instruction hierarchy: backdoor-powered prompt injection attacks, where the attackers utilize the backdoor attack for prompt injection attack purposes. Specifically, the attackers poison the supervised fine-tuning samples and insert the backdoor into the model. Once the trigger is activated, the backdoored model executes the injected instruction surrounded by the trigger. We construct a benchmark for comprehensive evaluation. Our experiments demonstrate that backdoor-powered prompt injection attacks are more harmful than previous prompt injection attacks, nullifying existing prompt injection defense methods, even the instruction hierarchy techniques.

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

TL;DR

Backdoor-powered prompt injection attacks are shown to be more harmful than previous prompt injection attacks, nullifying existing prompt injection defense methods, even the instruction hierarchy techniques.

Abstract

With the development of technology, large language models (LLMs) have dominated the downstream natural language processing (NLP) tasks. However, because of the LLMs' instruction-following abilities and inability to distinguish the instructions in the data content, such as web pages from search engines, the LLMs are vulnerable to prompt injection attacks. These attacks trick the LLMs into deviating from the original input instruction and executing the attackers' target instruction. Recently, various instruction hierarchy defense strategies are proposed to effectively defend against prompt injection attacks via fine-tuning. In this paper, we explore more vicious attacks that nullify the prompt injection defense methods, even the instruction hierarchy: backdoor-powered prompt injection attacks, where the attackers utilize the backdoor attack for prompt injection attack purposes. Specifically, the attackers poison the supervised fine-tuning samples and insert the backdoor into the model. Once the trigger is activated, the backdoored model executes the injected instruction surrounded by the trigger. We construct a benchmark for comprehensive evaluation. Our experiments demonstrate that backdoor-powered prompt injection attacks are more harmful than previous prompt injection attacks, nullifying existing prompt injection defense methods, even the instruction hierarchy techniques.

Paper Structure

This paper contains 51 sections, 8 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: (a) is an example of a regular prompt injection attack. The text in orange is an injected instruction. (b) is an example of the backdoor-powered prompt injection attack. The text in red is the trigger. The backdoored LLM has been trained such that the trigger induces it to only execute the injected instruction within the trigger region.
  • Figure 2: An illustration of the backdoor instruction tuning process. The clean input is a standard instruction and the corresponding response is the clean output. In contrast, the poisoned input includes the clean instruction along with the injected instruction, which is surrounded by the trigger. The poisoned output is the response to the injected instruction. The trigger "[hello], [hello]" is randomly selected and can be adjusted based on the attack scenarios.
  • Figure 3: The ablation study of backdoor poison rate. The evaluation metrics is the ASR and all the results are reported in %. "StruQ" means the backdoored model is post-trained with StruQ defense method.
  • Figure 4: Backdoor and clean model performance on the MMLU dataset. The evaluation metrics is the accuracy, and all the results are reported in %.
  • Figure 5: An illustration of two instruction hierarchy fine-tuning defense methods StruQ (a) and SecAlign (b). The training input of the two defense methods are similar. For StruQ, the training target is the response to the original input instruction. For SecAlign, they utilize DPO for fine-tuning. The chosen target is the response to the original input instruction and the rejected target is the response to the injected instruction.
  • ...and 2 more figures