Table of Contents
Fetching ...

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Minling Zhang

Abstract

Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs' latent representations, effectively strengthening the LRMs' safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs' general reasoning performance.

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

Abstract

Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs' latent representations, effectively strengthening the LRMs' safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs' general reasoning performance.
Paper Structure (33 sections, 7 equations, 10 figures, 9 tables)

This paper contains 33 sections, 7 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparisons of CoT-enabled (CoT-ON) and CoT-disabled (CoT-OFF) states with DeepSeek-R1 series (DS-R1-7B/8B/14B). Safety is tested on Wildjailbreak jiang2024wildteaming and reasoning is tested on AIME24 aime24.
  • Figure 2: Comparisons among (a) Vanilla LRMs, (b) LRMs with safety alignment, and (c) LRMs with PreSafe (ours).
  • Figure 3: The overall framework of our method. (1) Extracting safety decision signals. We first extract the safety decision-making capability of a safer teacher model into a Bert-based classifier. This classifier encapsulates the teacher's binary safety policy. (2) Alignment for safety decision signals. This process promotes LRMs' safety decision-making capability by training models' latent representation. (3) Inference Phase. The trained model achieves a stronger safety decision-making capability. It tends to refuse before producing CoT for harmful queries (Path A), while retaining the full reasoning capability (Path B) for benign requests.
  • Figure 4: Evaluation of reasoning capabilities across SafeChain, R2D, and PreSafe on AIME2024, Math-500 and GPQA-Diamond.
  • Figure 5: Layer-wise update distribution induced by PreSafe on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B. For each model, the most significant variations are predominantly observed in the gate_proj and up_proj components. Furthermore, these changes are primarily concentrated in the deeper layers.
  • ...and 5 more figures