Table of Contents
Fetching ...

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu

TL;DR

This work investigates why safety alignment fails in large reasoning models by applying mechanistic interpretability through linear probing to detect refusal intentions. It identifies a Refusal Cliff, where internal refusal signals collapse at the final output tokens, and causally links it to a sparse set of Refusal Suppression Heads in attention layers. Ablating a small fraction of these heads reliably boosts refusal signals and reduces attack success, while a data-selection method called Cliff-as-a-Judge uses misalignment between internal intent and final output to curate a tiny, high-impact training set that yields safety gains with roughly 1.7% of vanilla data. Collectively, the paper demonstrates a practical, data-efficient path to improving safety alignment in LRMs by combining mechanistic insights with targeted fine-tuning. These results offer a principled framework for accelerating safe reasoning in large models and highlight attention-head dynamics as a lever for robust safety behavior.

Abstract

Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as \textbf{refusal cliff}: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\% of these heads can reduce attack success rates below 10\%. Building on these mechanistic insights, we propose \textbf{Cliff-as-a-Judge}, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

TL;DR

This work investigates why safety alignment fails in large reasoning models by applying mechanistic interpretability through linear probing to detect refusal intentions. It identifies a Refusal Cliff, where internal refusal signals collapse at the final output tokens, and causally links it to a sparse set of Refusal Suppression Heads in attention layers. Ablating a small fraction of these heads reliably boosts refusal signals and reduces attack success, while a data-selection method called Cliff-as-a-Judge uses misalignment between internal intent and final output to curate a tiny, high-impact training set that yields safety gains with roughly 1.7% of vanilla data. Collectively, the paper demonstrates a practical, data-efficient path to improving safety alignment in LRMs by combining mechanistic insights with targeted fine-tuning. These results offer a principled framework for accelerating safe reasoning in large models and highlight attention-head dynamics as a lever for robust safety behavior.

Abstract

Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as \textbf{refusal cliff}: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\% of these heads can reduce attack success rates below 10\%. Building on these mechanistic insights, we propose \textbf{Cliff-as-a-Judge}, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.

Paper Structure

This paper contains 36 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: An overview of our paper. Left: We train a prober and discover the refusal cliff. Center: We find Refusal Suppression Heads as the main cause of the cliff. Right: We design data selection method based on probing the cliff.
  • Figure 2: While some reasoning models achieve reasonable safety performance, a significant portion exhibit alarming vulnerabilities to adversarial attacks. We benchmark reasoning models (RLVR-based and Distillation-based) on AdvBench chao2024jailbreakbench and WildJailbreak wildteaming2024 with Attack Success Rate (ASR, the lower the better) as evaluation metric.
  • Figure 3: The loss, validation accuracy and OOD validation accuracy of the refusal prober.
  • Figure 4: Left: Reasoning model with refusal cliff. We highlight the cliff position with orange background. Right: Well-aligned reasoning models experience no refusal cliff.
  • Figure 5: The first column on the left: Layer-wise refusal score of R1-Distill-Qwen-7B and R1-Distill-LLaMA-8B from shallow layers to deeper layers . The second column on the left: Comparison of refusal score in normal prompts and plateau values. Gray line is the average refusal score in normal prompts and Green line is the plateau of well-aligned family models. The third and fourth column on the left: Relation between thinking length and misalignment. We gradually clip thinking and force the model to directly answer.
  • ...and 3 more figures