Table of Contents
Fetching ...

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models

Xin Gao, Shaohan Yu, Zerui Chen, Yueming Lyu, Weichen Yu, Guanghao Li, Jiyao Liu, Jianxiong Gao, Jian Liang, Ziwei Liu, Chenyang Si

TL;DR

SafeRBench addresses a gap in large reasoning model safety by providing an end-to-end assessment across inputs, reasoning traces, and outputs. It introduces risk-level stratification for inputs, micro-thought chunking of reasoning traces with cognitive-intent labeling, and ten safety dimensions that are aggregated into two composite scores: Risk Exposure Score (RES) and Safety Awareness Score (SAS). The framework is validated on 19 LRMs, showing that reasoning traces strongly predict final safety and revealing scale-dependent dynamics where thinking helps safety up to a point but can increase risk at very large scales. The approach yields actionable insights for designing safer LRMs in high-stakes settings and aligns AI judgments with human safety assessments.

Abstract

Large Reasoning Models (LRMs) improve answer quality through explicit chain-of-thought, yet this very capability introduces new safety risks: harmful content can be subtly injected, surface gradually, or be justified by misleading rationales within the reasoning trace. Existing safety evaluations, however, primarily focus on output-level judgments and rarely capture these dynamic risks along the reasoning process. In this paper, we present SafeRBench, the first benchmark that assesses LRM safety end-to-end -- from inputs and intermediate reasoning to final outputs. (1) Input Characterization: We pioneer the incorporation of risk categories and levels into input design, explicitly accounting for affected groups and severity, and thereby establish a balanced prompt suite reflecting diverse harm gradients. (2) Fine-Grained Output Analysis: We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units, enabling fine-grained evaluation across ten safety dimensions. (3) Human Safety Alignment: We validate LLM-based evaluations against human annotations specifically designed to capture safety judgments. Evaluations on 19 LRMs demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models

TL;DR

SafeRBench addresses a gap in large reasoning model safety by providing an end-to-end assessment across inputs, reasoning traces, and outputs. It introduces risk-level stratification for inputs, micro-thought chunking of reasoning traces with cognitive-intent labeling, and ten safety dimensions that are aggregated into two composite scores: Risk Exposure Score (RES) and Safety Awareness Score (SAS). The framework is validated on 19 LRMs, showing that reasoning traces strongly predict final safety and revealing scale-dependent dynamics where thinking helps safety up to a point but can increase risk at very large scales. The approach yields actionable insights for designing safer LRMs in high-stakes settings and aligns AI judgments with human safety assessments.

Abstract

Large Reasoning Models (LRMs) improve answer quality through explicit chain-of-thought, yet this very capability introduces new safety risks: harmful content can be subtly injected, surface gradually, or be justified by misleading rationales within the reasoning trace. Existing safety evaluations, however, primarily focus on output-level judgments and rarely capture these dynamic risks along the reasoning process. In this paper, we present SafeRBench, the first benchmark that assesses LRM safety end-to-end -- from inputs and intermediate reasoning to final outputs. (1) Input Characterization: We pioneer the incorporation of risk categories and levels into input design, explicitly accounting for affected groups and severity, and thereby establish a balanced prompt suite reflecting diverse harm gradients. (2) Fine-Grained Output Analysis: We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units, enabling fine-grained evaluation across ten safety dimensions. (3) Human Safety Alignment: We validate LLM-based evaluations against human annotations specifically designed to capture safety judgments. Evaluations on 19 LRMs demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.

Paper Structure

This paper contains 36 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of SafeRBench.
  • Figure 2: Illustrative “risk spectrum” with example queries and rationales.
  • Figure 3: SafeRBench evaluation of 19 Large Reasoning Models (LRMs) across 10 dimensions, divided into Risk Exposure and Safety Awareness, contributing to the Overall Safety Score. Results are normalized for comparison. See Table \ref{['tab:safety_awareness']} and \ref{['tab:risk_exposure']} for detailed numerical results.
  • Figure 4: Pairwise correlations between key dimensions of model performance. A linear fit is applied to visualize the correlation, with Spearman's correlation coefficient ($\rho$) calculated for each pair.
  • Figure 5: Comparison of answer risk level, execution level, and non-rejection rate in Thinking vs. Non-Thinking modes for the Qwen3 series models.
  • ...and 3 more figures