Table of Contents
Fetching ...

Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

Jiawei Zhao, Kejiang Chen, Xiaojian Yuan, Weiming Zhang

TL;DR

The paper tackles jailbreaking of large language models by introducing Prefix Guidance (PG), a plug-and-play defense that prefixes the model output and channels a lightweight external classifier to detect harmful prompts. The PG pipeline combines a carefully selected refusal prefix with a binary classifier to decide whether to refuse harmful inputs or proceed with normal outputs, preserving model capabilities. Across Vicuna, Llama2, and Guanaco, and against five jailbreak methods, PG substantially lowers attack success rates and harmful outputs, matching or surpassing state-of-the-art SafeDecoding while incurring limited performance degradation on general evaluation (Just-Eval). The work introduces a harmful-instruction dataset, provides detailed training and evaluation protocols, and offers a practical, reproducible defense strategy for safer deployment of LLMs.

Abstract

In recent years, the rapid development of large language models (LLMs) has achieved remarkable performance across various tasks. However, research indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can induce the generation of harmful content through meticulously crafted prompts. This vulnerability poses significant challenges to the secure use and promotion of LLMs. Existing defense methods offer protection from different perspectives but often suffer from insufficient effectiveness or a significant impact on the model's capabilities. In this paper, we propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which guides the model to identify harmful prompts by directly setting the first few tokens of the model's output. This approach combines the model's inherent security capabilities with an external classifier to defend against jailbreak attacks. We demonstrate the effectiveness of PG across three models and five attack methods. Compared to baselines, our approach is generally more effective on average. Additionally, results on the Just-Eval benchmark further confirm PG's superiority to preserve the model's performance. our code is available at https://github.com/weiyezhimeng/Prefix-Guidance.

Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

TL;DR

The paper tackles jailbreaking of large language models by introducing Prefix Guidance (PG), a plug-and-play defense that prefixes the model output and channels a lightweight external classifier to detect harmful prompts. The PG pipeline combines a carefully selected refusal prefix with a binary classifier to decide whether to refuse harmful inputs or proceed with normal outputs, preserving model capabilities. Across Vicuna, Llama2, and Guanaco, and against five jailbreak methods, PG substantially lowers attack success rates and harmful outputs, matching or surpassing state-of-the-art SafeDecoding while incurring limited performance degradation on general evaluation (Just-Eval). The work introduces a harmful-instruction dataset, provides detailed training and evaluation protocols, and offers a practical, reproducible defense strategy for safer deployment of LLMs.

Abstract

In recent years, the rapid development of large language models (LLMs) has achieved remarkable performance across various tasks. However, research indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can induce the generation of harmful content through meticulously crafted prompts. This vulnerability poses significant challenges to the secure use and promotion of LLMs. Existing defense methods offer protection from different perspectives but often suffer from insufficient effectiveness or a significant impact on the model's capabilities. In this paper, we propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which guides the model to identify harmful prompts by directly setting the first few tokens of the model's output. This approach combines the model's inherent security capabilities with an external classifier to defend against jailbreak attacks. We demonstrate the effectiveness of PG across three models and five attack methods. Compared to baselines, our approach is generally more effective on average. Additionally, results on the Just-Eval benchmark further confirm PG's superiority to preserve the model's performance. our code is available at https://github.com/weiyezhimeng/Prefix-Guidance.
Paper Structure (49 sections, 5 equations, 7 figures, 7 tables)

This paper contains 49 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of Prefix Guidance.
  • Figure 2: The components of the LLM's input and output. The green section represents the system prompt, the yellow section represents the user prefix, the red section represents the user prompt, the blue section represents the assistant prefix, and the purple section represents the final model output, the assistant prompt.
  • Figure 3: The experimental setup of Self-Reminder.
  • Figure 4: The experimental setup of Self-Examination.
  • Figure 5: An example input for the Vicuna model.
  • ...and 2 more figures