Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu
TL;DR
This work investigates why safety alignment fails in large reasoning models by applying mechanistic interpretability through linear probing to detect refusal intentions. It identifies a Refusal Cliff, where internal refusal signals collapse at the final output tokens, and causally links it to a sparse set of Refusal Suppression Heads in attention layers. Ablating a small fraction of these heads reliably boosts refusal signals and reduces attack success, while a data-selection method called Cliff-as-a-Judge uses misalignment between internal intent and final output to curate a tiny, high-impact training set that yields safety gains with roughly 1.7% of vanilla data. Collectively, the paper demonstrates a practical, data-efficient path to improving safety alignment in LRMs by combining mechanistic insights with targeted fine-tuning. These results offer a principled framework for accelerating safe reasoning in large models and highlight attention-head dynamics as a lever for robust safety behavior.
Abstract
Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as \textbf{refusal cliff}: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\% of these heads can reduce attack success rates below 10\%. Building on these mechanistic insights, we propose \textbf{Cliff-as-a-Judge}, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.
