Table of Contents
Fetching ...

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Heegyu Kim, Sehyun Yuk, Hyunsouk Cho

TL;DR

The paper tackles jailbreaker vulnerabilities in language models by proposing a training-free Self-Refine workflow that iteratively critiques and improves its own outputs, guided by a cost model. It introduces attention-shifting formatting (JSON and code) to boost refinement efficiency and demonstrates that Self-Refine substantially lowers attack success rates and harm costs across non-safety-aligned LMs, outperforming several baseline defenses. GPT-4 and reward-model evaluations reveal a complex tradeoff: while safety-focused defenses can reduce helpfulness in some cases, non-safety LMs can deliver safer and more helpful results under Self-Refine, with formatting accelerating convergence and reducing computational load. Overall, the approach offers a practical, resource-efficient path to safer deployment of open language models without retraining, though it acknowledges limits in achieving perfect safety across all models and the need for careful design choices.

Abstract

Caution: This paper includes offensive words that could potentially cause unpleasantness. Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses. In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be easily utilized in real-world service.

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

TL;DR

The paper tackles jailbreaker vulnerabilities in language models by proposing a training-free Self-Refine workflow that iteratively critiques and improves its own outputs, guided by a cost model. It introduces attention-shifting formatting (JSON and code) to boost refinement efficiency and demonstrates that Self-Refine substantially lowers attack success rates and harm costs across non-safety-aligned LMs, outperforming several baseline defenses. GPT-4 and reward-model evaluations reveal a complex tradeoff: while safety-focused defenses can reduce helpfulness in some cases, non-safety LMs can deliver safer and more helpful results under Self-Refine, with formatting accelerating convergence and reducing computational load. Overall, the approach offers a practical, resource-efficient path to safer deployment of open language models without retraining, though it acknowledges limits in achieving perfect safety across all models and the need for careful design choices.

Abstract

Caution: This paper includes offensive words that could potentially cause unpleasantness. Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses. In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be easily utilized in real-world service.
Paper Structure (34 sections, 9 figures, 19 tables, 1 algorithm)

This paper contains 34 sections, 9 figures, 19 tables, 1 algorithm.

Figures (9)

  • Figure 1: Rate of successful jailbreak prompt attack
  • Figure 2: Various strategies of jailbreak attacks
  • Figure 3: The Self-Refine process.
  • Figure 4: ASR of the base LMs by iterative self-refine
  • Figure 5: Example prompts of the self-refine with JSON formatting and no formatting.
  • ...and 4 more figures