Safety in Large Reasoning Models: A Survey
Cheng Wang, Yue Liu, Baolong Bi, Duzhen Zhang, Zhong-Zhi Li, Yingwei Ma, Yufei He, Shengju Yu, Xinfeng Li, Junfeng Fang, Jiaheng Zhang, Bryan Hooi
TL;DR
Large Reasoning Models introduce explicit reasoning traces that create new safety vulnerabilities not present in prior LLMs. The paper provides a comprehensive taxonomy of inherent risks, adversarial attacks, and defense strategies, including safety alignment, inference-time defenses, and guard models, specifically tailored to reasoning-heavy systems. Key contributions include a spectrum of attack typologies (reasoning-length, backdoor, error injection, prompt injection, jailbreak) and a multi-faceted defense framework, as well as recommendations for standardized benchmarks and human-in-the-loop alignment. The findings highlight the need for reasoning-aware safeguards and evaluation to ensure safe deployment of LRMs in high-stakes domains.
Abstract
Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents a comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.
