Table of Contents
Fetching ...

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Yanqing Liu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie

TL;DR

STAR-1 tackles the safety-performance trade-off in large reasoning models by constructing a 1K-scale, high-quality safety dataset grounded in deliberative reasoning and diverse policies. The authors curate 41K sources into 40,961 unique harmful instructions, classify them into eight safety categories with policy-grounded CoT prompts, and distill them to 1K high-quality samples via GPT-4o scoring and diversity filtering. Finetuning LRMs on STAR-1 yields substantial safety gains (around 40% on average) with only modest reasoning declines (about 1–3%), and ablations show the deliberative reasoning and high-quality filtering as key drivers. The work demonstrates that carefully curated small datasets can outperform larger unsafe baselines and offers a practical, scalable path to safer LRMs, with implications for robust alignment in real-world systems.

Abstract

This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

TL;DR

STAR-1 tackles the safety-performance trade-off in large reasoning models by constructing a 1K-scale, high-quality safety dataset grounded in deliberative reasoning and diverse policies. The authors curate 41K sources into 40,961 unique harmful instructions, classify them into eight safety categories with policy-grounded CoT prompts, and distill them to 1K high-quality samples via GPT-4o scoring and diversity filtering. Finetuning LRMs on STAR-1 yields substantial safety gains (around 40% on average) with only modest reasoning declines (about 1–3%), and ablations show the deliberative reasoning and high-quality filtering as key drivers. The work demonstrates that carefully curated small datasets can outperform larger unsafe baselines and offers a practical, scalable path to safer LRMs, with implications for robust alignment in real-world systems.

Abstract

This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.

Paper Structure

This paper contains 61 sections, 2 equations, 8 figures, 26 tables.

Figures (8)

  • Figure 1: Left: LRMs are vulnerable to malicious instructions. Middle: Generation pipeline of STAR-1. Each malicious instruction is tagged with a relevant safety category. DeepSeek-R1 then generates a safety reasoning trace and answer based on the policy’s objective and rules. GPT-4o evaluates the outputs across three criteria, and low-scoring samples are discarded. Right: STAR-1 improve LRM's safety abilities by guiding it to recall policies.
  • Figure 2: Safety category distribution of the our metadata (left) and STAR-1 (right). We make sure that the filtering process does not decrease the diversity of safety categories.
  • Figure 3: The average performance gap between (1) model trained on STAR-1 and Instruct model (blue); (2) model trained on STAR-1 and the R1-distilled model (red) on both safety and reasoning tasks across five model types.
  • Figure 4: Results of two models trained with STAR-1 and varied amounts of not_overrefusal (benign) examples on the overrefusal rottger2023xstest, safety, and reasoning tasks.
  • Figure 5: Example of our STAR-1 data
  • ...and 3 more figures