Table of Contents
Fetching ...

RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process

Peiran Wang, Xiaogeng Liu, Chaowei Xiao

TL;DR

This paper tackles jailbreak vulnerabilities in LLMs by introducing RePD, a retrieval-based prompt decomposition framework that uses one-shot learning to isolate and neutralize harmful components embedded within user prompts. By retrieving a closest jailbreak template from a database and constructing a one-shot example, RePD teaches the model to decouple the input into a harmless template and a query before responding, thereby preserving normal performance on benign prompts. Empirical results on LLaMA-2-7B-Chat and Vicuna-7B-V1.5 show substantial reductions in attack success rate (ASR) and competitive false positive rates, with further improvements achieved via the multi-agent variant RePD-M and prompt randomization. The approach offers an efficient, adaptable defense against template-based jailbreaks with potential for deployment in open-source LLM ecosystems.

Abstract

In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user's original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user's prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.

RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process

TL;DR

This paper tackles jailbreak vulnerabilities in LLMs by introducing RePD, a retrieval-based prompt decomposition framework that uses one-shot learning to isolate and neutralize harmful components embedded within user prompts. By retrieving a closest jailbreak template from a database and constructing a one-shot example, RePD teaches the model to decouple the input into a harmless template and a query before responding, thereby preserving normal performance on benign prompts. Empirical results on LLaMA-2-7B-Chat and Vicuna-7B-V1.5 show substantial reductions in attack success rate (ASR) and competitive false positive rates, with further improvements achieved via the multi-agent variant RePD-M and prompt randomization. The approach offers an efficient, adaptable defense against template-based jailbreaks with potential for deployment in open-source LLM ecosystems.

Abstract

In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user's original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user's prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.

Paper Structure

This paper contains 28 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We propose RePD, a retrieval-based prompt decomposition framework to defend against jailbreak attacks. Each time RePD receives a user prompt, it will retrieve a jailbreak prompt template from a retrieval database which consists of multiple collected jailbreak prompt templates. Then by inserting the decomposition process of decomplishing the jailbreak prompt to the harmful questions into the user prompt, RePD teaches LLM how to decouple the jailbreak prompt according to the retrieval template.
  • Figure 2: We provide an example for RePD.
  • Figure 3: The time cost of RePD and RePD-M with retrieval and non-retrieval.
  • Figure 4: We compare single-agent RePD with multi-agent RePD-M's effectiveness against adaptive attack. The experiment results show that RePD-M outperforms RePD in both ASR and FPR. This indicates that RePD-M has a better defense effectiveness for adaptive attacks.
  • Figure 5: Evaluation of different defense frameworks on different attack methods.
  • ...and 2 more figures