Table of Contents
Fetching ...

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Chentao Cao, Xiaojun Xu, Bo Han, Hang Li

TL;DR

This paper introduces a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user.

Abstract

As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to answer the question in their thoughts directly and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K samples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates. Notably, the fine-tuned model maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform safe completion, while post-hoc detection methods can only directly reject sensitive, harmful queries (e.g., self-harm). Our results show that inference-time strategies alone are insufficient, highlighting the necessity of safety training, and we find even $500$ samples can yield performance comparable to the entire dataset, suggesting a promising path for data-efficient safety alignment. The dataset is publicly available at: https://huggingface.co/datasets/ByteDance-Seed/ReSA.

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

TL;DR

This paper introduces a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user.

Abstract

As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to answer the question in their thoughts directly and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K samples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates. Notably, the fine-tuned model maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform safe completion, while post-hoc detection methods can only directly reject sensitive, harmful queries (e.g., self-harm). Our results show that inference-time strategies alone are insufficient, highlighting the necessity of safety training, and we find even samples can yield performance comparable to the entire dataset, suggesting a promising path for data-efficient safety alignment. The dataset is publicly available at: https://huggingface.co/datasets/ByteDance-Seed/ReSA.

Paper Structure

This paper contains 86 sections, 1 equation, 26 figures, 19 tables, 1 algorithm.

Figures (26)

  • Figure 1: Comparison of jailbreak defense between standard aligned models (top) and our ReSA-SFT/RL model with the "Answer-Then-Check" strategy (bottom). Whereas conventional aligned models remain vulnerable to jailbreak attempts, ReSA-SFT/RL strengthens defense by first generating an intended answer summary and then performing a safety analysis before the final response.
  • Figure 2: The ReSA dataset curation pipeline, which consists of three main stages: safety query collection, answer summary generation, and safety analysis synthesis.
  • Figure 3: The Answer-Then-Check reasoning template. The template structures the reasoning process into three parts: intended answer summary, safety analysis, and final response based on analysis.
  • Figure 4: Performance with varying ReSA training set sizes, where the left panel shows average safety against jailbreaks, the middle shows over-refusal accuracy, and the right shows general reasoning capabilities averaged over MATH500, HumanEval, and MMLU.
  • Figure 5: Representative examples of the four query categories used in the ReSA dataset: vanilla benign, adversarial benign, vanilla harmful, and adversarial harmful queries.
  • ...and 21 more figures