The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, Xin Eric Wang
TL;DR
This work provides a comprehensive safety evaluation of large reasoning models (LRMs), contrasting open-source variants like DeepSeek-R1 with proprietary baselines such as o3-mini. It introduces a multi-faceted framework that includes safety benchmarks, adversarial attacks, and an analysis of safety in the model’s internal thinking vs final outputs, augmented by harmfulness assessments via reward models. Key findings reveal a substantial safety gap for open LRMs, a link between stronger reasoning and higher potential harm, and notable safety risks embedded in intermediate thinking steps. The results underscore the need for stronger safety alignment in LRMs, particularly for reasoning processes, and propose approaches such as rule-based rewards and domain-specific safety data to mitigate risks. The study also discusses limitations, including opacity of proprietary systems and the challenge of mitigating unsafe reasoning trajectories, guiding future safety research for deliberative AI systems.
Abstract
The rapid development of large reasoning models (LRMs), such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source reasoning models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on open LRMs is needed. (2) The stronger the model's reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (3) Safety thinking emerges in the reasoning process of LRMs, but fails frequently against adversarial attacks. (4) The thinking process in R1 models poses greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models' safety to close the gap.
