Table of Contents
Fetching ...

RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability

Yichi Zhang, Zihao Zeng, Dongbai Li, Yao Huang, Zhijie Deng, Yinpeng Dong

TL;DR

RealSafe-R1 targets safety gaps in open-source Large Reasoning Models by distilling $15{,}000$ safety-aware reasoning trajectories from DeepSeek-R1 and applying supervised fine-tuning via the LLaMA-Factory framework to produce models that refuse unsafe queries while preserving reasoning performance. The approach leverages explicit refusal policies within the training data to align behavior without altering the underlying generation distribution, achieving substantial safety gains across benchmarks like XSTest and WildChat, and maintaining or improving non-safety tasks such as AIME-2024 and TruthfulQA. The results demonstrate that safety alignment can be achieved with minimal utility loss and highlight a practical path toward safer open-source LRMs, albeit with some over-refusal that invites further refinement. The work contributes open-source RealSafe-R1 weights and a scalable methodology for safety-aware reasoning in LRMs.

Abstract

Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have been rapidly progressing and achieving breakthrough performance on complex reasoning tasks such as mathematics and coding. However, the open-source R1 models have raised safety concerns in wide applications, such as the tendency to comply with malicious queries, which greatly impacts the utility of these powerful models in their applications. In this paper, we introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 distilled models. To train these models, we construct a dataset of 15k safety-aware reasoning trajectories generated by DeepSeek-R1, under explicit instructions for expected refusal behavior. Both quantitative experiments and qualitative case studies demonstrate the models' improvements, which are shown in their safety guardrails against both harmful queries and jailbreak attacks. Importantly, unlike prior safety alignment efforts that often compromise reasoning performance, our method preserves the models' reasoning capabilities by maintaining the training data within the original distribution of generation. Model weights of RealSafe-R1 are open-source at https://huggingface.co/RealSafe.

RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability

TL;DR

RealSafe-R1 targets safety gaps in open-source Large Reasoning Models by distilling safety-aware reasoning trajectories from DeepSeek-R1 and applying supervised fine-tuning via the LLaMA-Factory framework to produce models that refuse unsafe queries while preserving reasoning performance. The approach leverages explicit refusal policies within the training data to align behavior without altering the underlying generation distribution, achieving substantial safety gains across benchmarks like XSTest and WildChat, and maintaining or improving non-safety tasks such as AIME-2024 and TruthfulQA. The results demonstrate that safety alignment can be achieved with minimal utility loss and highlight a practical path toward safer open-source LRMs, albeit with some over-refusal that invites further refinement. The work contributes open-source RealSafe-R1 weights and a scalable methodology for safety-aware reasoning in LRMs.

Abstract

Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have been rapidly progressing and achieving breakthrough performance on complex reasoning tasks such as mathematics and coding. However, the open-source R1 models have raised safety concerns in wide applications, such as the tendency to comply with malicious queries, which greatly impacts the utility of these powerful models in their applications. In this paper, we introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 distilled models. To train these models, we construct a dataset of 15k safety-aware reasoning trajectories generated by DeepSeek-R1, under explicit instructions for expected refusal behavior. Both quantitative experiments and qualitative case studies demonstrate the models' improvements, which are shown in their safety guardrails against both harmful queries and jailbreak attacks. Importantly, unlike prior safety alignment efforts that often compromise reasoning performance, our method preserves the models' reasoning capabilities by maintaining the training data within the original distribution of generation. Model weights of RealSafe-R1 are open-source at https://huggingface.co/RealSafe.

Paper Structure

This paper contains 10 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An example of DeepSeek-R1 complying with a query with illegal intention, even though it shows safety awareness during reasoning.
  • Figure 2: Visualization of model behavior on safety-critical prompts. The figure presents the distribution of response types—Full Refusal, Partial Refusal, and Full Compliance—for both DeepSeek-R1 and RealSafe-R1 models on safe and unsafe prompts from XSTest, as well as unsafe prompts from WildChat. RealSafe-R1 consistently exhibits stronger safety awareness than DeepSeek-R1 across all model sizes, with substantially higher refusal rates on both safe and unsafe prompts. In addition, larger models—regardless of alignment—tend to refuse less, suggesting an inverse correlation between model size and refusal likelihood.
  • Figure 3: A comparison of safety responses between DeepSeek-R1 and RealSafe-R1 on harmful and jailbreak prompts.
  • Figure 4: An instance of over-refusal by RealSafe-R1.