Table of Contents
Fetching ...

Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

Jinghan Jia, Nathalie Baracaldo, Sijia Liu

TL;DR

This work addresses safety challenges in large reasoning models (LRMs) that reveal unsafe reasoning within chain-of-thought traces. It critically analyzes the limitations of supervised fine-tuning (SFT) for safety, showing weak cross-model generalization, dataset sensitivity, and potential harm to reasoning. The authors propose a reinforcement learning–based safety alignment framework that directly optimizes safety-rewarded policies and preserves reasoning performance, demonstrating stronger and more consistent safety gains across multiple model families and benchmarks. Fine-grained analyses of reflection dynamics and token-level entropy support the mechanism: RL suppresses unsafe exploratory reasoning while maintaining reflective depth, yielding safer and more reliable LRMs with practical implications for scalable safety deployment.

Abstract

Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.

Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

TL;DR

This work addresses safety challenges in large reasoning models (LRMs) that reveal unsafe reasoning within chain-of-thought traces. It critically analyzes the limitations of supervised fine-tuning (SFT) for safety, showing weak cross-model generalization, dataset sensitivity, and potential harm to reasoning. The authors propose a reinforcement learning–based safety alignment framework that directly optimizes safety-rewarded policies and preserves reasoning performance, demonstrating stronger and more consistent safety gains across multiple model families and benchmarks. Fine-grained analyses of reflection dynamics and token-level entropy support the mechanism: RL suppresses unsafe exploratory reasoning while maintaining reflective depth, yielding safer and more reliable LRMs with practical implications for scalable safety deployment.

Abstract

Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.

Paper Structure

This paper contains 31 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Safety performance on the AttaQ benchmark, comparing the LRM DeepSeek-R1-Distill-Qwen-7B with the standard instruction-tuned model Qwen2.5-7B-Instruct across multiple safety categories. Higher scores indicate better safety. The LRM consistently lags behind the LLM of the same scale, revealing a pronounced safety gap.
  • Figure 2: Safety performance on AttaQ for different response components of DeepSeek-R1-Distill-Qwen-7B (response$\mathbf{y}$ and whole response$\mathbf{t}+\mathbf{y}$), compared with Qwen2.5-7B-Instruct. Higher scores indicate better safety.
  • Figure 3: Safety performance of Qwen3-8B on the AttaQ benchmark in non-thinking and thinking modes. Higher scores indicate stronger safety. All other settings follow Figure \ref{['fig: lrm_vs_llm']}.
  • Figure 4: Average safety scores on the AttaQ benchmark for Granite-4.0-Tiny-Preview and DeepSeek-R1-Distill-Qwen-7B, before and after SFT with STAR-1 data. Bars show performance without SFT (blue) and the improvement after SFT (pink).
  • Figure 5: Distribution of Min-K% Probability (Min-60% Prob) shi2023detecting values for DeepSeek-R1-Distill-Qwen-7B and Granite-4.0-Tiny-Preview on STAR-1 data. Lower scores indicate stronger memorization.
  • ...and 6 more figures