Table of Contents
Fetching ...

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang

TL;DR

UnsafeChain targets a key blind spot in safety alignment for large reasoning models by focusing on hard prompts that reliably trigger unsafe outputs. It constructs a correction-based dataset from six domains, rewriting unsafe completions into safe, CoT-enabled responses via a strong LLM and verifying safety with automated checks. Fine-tuning three LRMs with LoRA adapters on UnsafeChain (and two 1K subsets) yields consistent safety and reasoning gains across 11 OOD and 5 ID benchmarks, often surpassing SafeChain and STAR-1, even with limited data. The work demonstrates that exposing models to failure scenarios and showing corrective safety reasoning improves robustness to adversarial prompts while preserving general reasoning abilities.

Abstract

As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

TL;DR

UnsafeChain targets a key blind spot in safety alignment for large reasoning models by focusing on hard prompts that reliably trigger unsafe outputs. It constructs a correction-based dataset from six domains, rewriting unsafe completions into safe, CoT-enabled responses via a strong LLM and verifying safety with automated checks. Fine-tuning three LRMs with LoRA adapters on UnsafeChain (and two 1K subsets) yields consistent safety and reasoning gains across 11 OOD and 5 ID benchmarks, often surpassing SafeChain and STAR-1, even with limited data. The work demonstrates that exposing models to failure scenarios and showing corrective safety reasoning improves robustness to adversarial prompts while preserving general reasoning abilities.

Abstract

As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain

Paper Structure

This paper contains 35 sections, 1 equation, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Overview of the UnsafeChain dataset construction pipeline.