Table of Contents
Fetching ...

The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1

Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, Xin Eric Wang

TL;DR

This work provides a comprehensive safety evaluation of large reasoning models (LRMs), contrasting open-source variants like DeepSeek-R1 with proprietary baselines such as o3-mini. It introduces a multi-faceted framework that includes safety benchmarks, adversarial attacks, and an analysis of safety in the model’s internal thinking vs final outputs, augmented by harmfulness assessments via reward models. Key findings reveal a substantial safety gap for open LRMs, a link between stronger reasoning and higher potential harm, and notable safety risks embedded in intermediate thinking steps. The results underscore the need for stronger safety alignment in LRMs, particularly for reasoning processes, and propose approaches such as rule-based rewards and domain-specific safety data to mitigate risks. The study also discusses limitations, including opacity of proprietary systems and the challenge of mitigating unsafe reasoning trajectories, guiding future safety research for deliberative AI systems.

Abstract

The rapid development of large reasoning models (LRMs), such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source reasoning models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on open LRMs is needed. (2) The stronger the model's reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (3) Safety thinking emerges in the reasoning process of LRMs, but fails frequently against adversarial attacks. (4) The thinking process in R1 models poses greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models' safety to close the gap.

The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1

TL;DR

This work provides a comprehensive safety evaluation of large reasoning models (LRMs), contrasting open-source variants like DeepSeek-R1 with proprietary baselines such as o3-mini. It introduces a multi-faceted framework that includes safety benchmarks, adversarial attacks, and an analysis of safety in the model’s internal thinking vs final outputs, augmented by harmfulness assessments via reward models. Key findings reveal a substantial safety gap for open LRMs, a link between stronger reasoning and higher potential harm, and notable safety risks embedded in intermediate thinking steps. The results underscore the need for stronger safety alignment in LRMs, particularly for reasoning processes, and propose approaches such as rule-based rewards and domain-specific safety data to mitigate risks. The study also discusses limitations, including opacity of proprietary systems and the challenge of mitigating unsafe reasoning trajectories, guiding future safety research for deliberative AI systems.

Abstract

The rapid development of large reasoning models (LRMs), such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source reasoning models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on open LRMs is needed. (2) The stronger the model's reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (3) Safety thinking emerges in the reasoning process of LRMs, but fails frequently against adversarial attacks. (4) The thinking process in R1 models poses greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models' safety to close the gap.

Paper Structure

This paper contains 40 sections, 12 figures, 11 tables.

Figures (12)

  • Figure 1: We perform a multi-faceted safety analysis of large reasoning and non-reasoning models, focusing on three key aspects: (1) Comparison of performance across safety benchmarks and attacks. (2) Analysis of safety differences in reasoning and final answer. (3) Evaluation of the harmfulness of model responses.
  • Figure 2: Level-2 categorized results of the models on Air-Bench.
  • Figure 3: The harmfulness evaluation result for two pairs of LLMs using two reward models on Air-Bench dataset. The response from reasoning models provides more help to the harmful questions.
  • Figure 4: Example of large reasoning model provides more detailed and structured responses to the malicious query compared with non-reasoning model.
  • Figure 5: Three Scenarios of the R1 Model in Jailbreak: (A) Identifies safety concerns but executes the user's request unreflectively. (B) Recognizes safety issues but is misled. (C) Fails to recognize any safety concerns.
  • ...and 7 more figures