STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Zijun Wang; Haoqin Tu; Yuhan Wang; Juncheng Wu; Yanqing Liu; Jieru Mei; Brian R. Bartoldson; Bhavya Kailkhura; Cihang Xie

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Yanqing Liu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie

TL;DR

STAR-1 tackles the safety-performance trade-off in large reasoning models by constructing a 1K-scale, high-quality safety dataset grounded in deliberative reasoning and diverse policies. The authors curate 41K sources into 40,961 unique harmful instructions, classify them into eight safety categories with policy-grounded CoT prompts, and distill them to 1K high-quality samples via GPT-4o scoring and diversity filtering. Finetuning LRMs on STAR-1 yields substantial safety gains (around 40% on average) with only modest reasoning declines (about 1–3%), and ablations show the deliberative reasoning and high-quality filtering as key drivers. The work demonstrates that carefully curated small datasets can outperform larger unsafe baselines and offers a practical, scalable path to safer LRMs, with implications for robust alignment in real-world systems.

Abstract

This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

TL;DR

Abstract

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)