Table of Contents
Fetching ...

Self-Guard: Empower the LLM to Safeguard Itself

Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, Kam-Fai Wong

TL;DR

Self-Guard addresses the jailbreak problem plaguing LLMs by proposing a two-stage framework that combines safety training and safeguard mechanisms. In stage one, the model's ability to assess harmful content is enhanced; in stage two, the model is instructed to consistently detect harmful content in its own outputs. Experiments indicate Self-Guard is robust against jailbreak attacks and does not degrade general model performance, with a note that some harmful prompts may still yield harmless responses. Sensitivity analyses suggest the approach avoids over-sensitivity and may mitigate it in practice. Overall, Self-Guard offers a practical path to safer LLM behavior without sacrificing capability, by enabling the model to monitor and regulate its own outputs.

Abstract

The jailbreak attack can bypass the safety measures of a Large Language Model (LLM), generating harmful content. This misuse of LLM has led to negative societal consequences. Currently, there are two main approaches to address jailbreak attacks: safety training and safeguards. Safety training focuses on further training LLM to enhance its safety. On the other hand, safeguards involve implementing external models or filters to prevent harmful outputs. However, safety training has constraints in its ability to adapt to new attack types and often leads to a drop in model performance. Safeguards have proven to be of limited help. To tackle these issues, we propose a novel approach called Self-Guard, which combines the strengths of both safety methods. Self-Guard includes two stages. In the first stage, we enhance the model's ability to assess harmful content, and in the second stage, we instruct the model to consistently perform harmful content detection on its own responses. The experiment has demonstrated that Self-Guard is robust against jailbreak attacks. In the bad case analysis, we find that LLM occasionally provides harmless responses to harmful queries. Additionally, we evaluated the general capabilities of the LLM before and after safety training, providing evidence that Self-Guard does not result in the LLM's performance degradation. In sensitivity tests, Self-Guard not only avoids inducing over-sensitivity in LLM but also can even mitigate this issue.

Self-Guard: Empower the LLM to Safeguard Itself

TL;DR

Self-Guard addresses the jailbreak problem plaguing LLMs by proposing a two-stage framework that combines safety training and safeguard mechanisms. In stage one, the model's ability to assess harmful content is enhanced; in stage two, the model is instructed to consistently detect harmful content in its own outputs. Experiments indicate Self-Guard is robust against jailbreak attacks and does not degrade general model performance, with a note that some harmful prompts may still yield harmless responses. Sensitivity analyses suggest the approach avoids over-sensitivity and may mitigate it in practice. Overall, Self-Guard offers a practical path to safer LLM behavior without sacrificing capability, by enabling the model to monitor and regulate its own outputs.

Abstract

The jailbreak attack can bypass the safety measures of a Large Language Model (LLM), generating harmful content. This misuse of LLM has led to negative societal consequences. Currently, there are two main approaches to address jailbreak attacks: safety training and safeguards. Safety training focuses on further training LLM to enhance its safety. On the other hand, safeguards involve implementing external models or filters to prevent harmful outputs. However, safety training has constraints in its ability to adapt to new attack types and often leads to a drop in model performance. Safeguards have proven to be of limited help. To tackle these issues, we propose a novel approach called Self-Guard, which combines the strengths of both safety methods. Self-Guard includes two stages. In the first stage, we enhance the model's ability to assess harmful content, and in the second stage, we instruct the model to consistently perform harmful content detection on its own responses. The experiment has demonstrated that Self-Guard is robust against jailbreak attacks. In the bad case analysis, we find that LLM occasionally provides harmless responses to harmful queries. Additionally, we evaluated the general capabilities of the LLM before and after safety training, providing evidence that Self-Guard does not result in the LLM's performance degradation. In sensitivity tests, Self-Guard not only avoids inducing over-sensitivity in LLM but also can even mitigate this issue.
Paper Structure (22 sections)

This paper contains 22 sections.

Table of Contents

  1. For every submission
  2. Did you discuss the limitations of your work?
  3. Did you discuss any potential risks of your work?
  4. Do the abstract and introduction summarize the paper’s main claims?
  5. Did you use or create scientific artifacts?
  6. Did you cite the creators of artifacts you used?
  7. Did you discuss the license or terms for use and/or distribution of any artifacts?
  8. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?
  9. Did you discuss the steps taken to check whether the data that was collected/used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?
  10. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?
  11. Did you report relevant statistics like the number of examples, details of train/test/dev splits, etc. for the data that you used/created?
  12. Did you run computational experiments?
  13. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?
  14. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?
  15. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?
  16. ...and 7 more sections