Table of Contents
Fetching ...

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi

TL;DR

Reinforcement-learning post-training for explicit chain-of-thought can boost reasoning in multimodal models but degrades safety alignment. SafeThink introduces inference-time steering that uses a safety reward model to monitor reasoning traces and injects a short prefix, such as “Wait, think safely,” when safety falls below a threshold, achieving safety recovery as a satisficing constraint. Across six open-source MLRMs and four jailbreak benchmarks, SafeThink reduces jailbreak attack success rates by about 30–60% while preserving reasoning performance, with corrective steering typically required only in the first 1–3 reasoning steps. This lightweight defense avoids retraining and demonstrates that safety-relevant behavior remains latent in RL-tuned models, enabling safer deployment of reasoning-capable AI systems.

Abstract

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

TL;DR

Reinforcement-learning post-training for explicit chain-of-thought can boost reasoning in multimodal models but degrades safety alignment. SafeThink introduces inference-time steering that uses a safety reward model to monitor reasoning traces and injects a short prefix, such as “Wait, think safely,” when safety falls below a threshold, achieving safety recovery as a satisficing constraint. Across six open-source MLRMs and four jailbreak benchmarks, SafeThink reduces jailbreak attack success rates by about 30–60% while preserving reasoning performance, with corrective steering typically required only in the first 1–3 reasoning steps. This lightweight defense avoids retraining and demonstrates that safety-relevant behavior remains latent in RL-tuned models, enabling safer deployment of reasoning-capable AI systems.

Abstract

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.
Paper Structure (27 sections, 12 equations, 16 figures, 4 tables)

This paper contains 27 sections, 12 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Overview of SafeThink. (a) Without intervention, reasoning MLRMs process adversarial queries through unsafe reasoning chains, producing harmful responses. SafeThink monitors the reasoning trace and injects a safety steering token when the safety threshold is violated. Safety recovery typically occurs within the first few reasoning steps, after which generation proceeds toward safe completions while preserving reasoning utility. (b) Reasoning fine-tuning improves task performance but degrades safety alignment, resulting in a higher attack success rate (ASR). For example, on the Hades benchmark Li-HADES-2024, R1-Onevision yang2025r1 exhibits a sharp decline in safety score (defined as $100 - \text{ASR}$) from $80.87\%$ to $30.93\%$ compared to its base model Qwen2.5-VL, illustrating the reasoning tax on safety. SafeThink recovers safety at inference time without sacrificing reasoning capabilities (Reasoning MLRM + SafeThink).
  • Figure 2: Best-of-$N$ sampling fails to recover safe trajectories. We empirically validate that under adversarial inputs $x_{\text{adv}}$, the probability of sampling a safe continuation from the base policy is near-zero. Given an intermediate state $x' = (x_{\text{adv}}, z_{<t})$, BoN$^*$ samples $k$ candidate next steps directly from the base policy, $z_t^{(i)} \sim \pi(\cdot \mid x')$, and selects the one with maximum $R_{\text{safe}}$. Despite increasing $k$ up to 20, BoN$^*$ (purple) remains below the safety threshold $\tau$, confirming that safe continuations have vanishing probability mass under the base policy. SafeThink conditions generation on a safety steering token $s$, sampling $z_t^{(i)} \sim \pi(\cdot \mid x', s)$. This shifts the distribution toward safe regions, allowing SafeThink to cross the threshold with as few as $k{=}2$ samples. The results demonstrate that the failure of naive sampling stems not from the absence of safe continuations, but from poor conditional coverage, a gap that safety-steered sampling effectively bridges. Results on HADES Li-HADES-2024 for (a) R1-Onevision, (b) VLAA-Thinker, and (c) Vision-R1.
  • Figure 3: Satisficing safety alignment. Safety rates saturate above the threshold ($\tau = 0$), with $\sim$90% of responses deemed safe by GPT-4. This validates our threshold-based constraint: ensuring $R_{\text{safe}} \geq \tau$ is sufficient for safety alignment.
  • Figure 4: Early-step steering suffices for safety recovery. Steering depth indicates the number of initial reasoning steps where safety-targeted intervention is applied. We evaluate three models on three jailbreak benchmarks: (a) LlamaV-o1 thawakar2025llamav on JailbreakV-28K luo2024jailbreakv, (b) R1-OneVision-7B yang2025r1 on Hades Li-HADES-2024, and (c) OpenVLThinker-7B deng2025openvlthinker on FigStep gong2023figstep. All models exhibit a sharp decline in Attack Success Rate (ASR) within the first few steering steps. The transition from jailbroken (red) to safe (green) regions demonstrates that targeted steering applied to only a small number of early reasoning steps is sufficient to redirect model trajectories toward safe completions, without requiring intervention throughout the entire generation process.
  • Figure 5: Evaluation of candidate steering tokens $\mathcal{S}$. We evaluate each steering token $s \in \mathcal{S}$ on two criteria: (a) safety reward $R_{\text{safe}}$, measuring effectiveness at redirecting reasoning toward safe continuations, and (b) KL divergence from the base policy, measuring distributional shift. Tokens lacking explicit safety language ("Wait, think again", "Lets rethink step by step again") remain below the safety threshold ($\tau=0$) across reasoning steps. Tokens with safety cues ("Lets rethink step by step safely", "Wait, think safely") consistently exceed the threshold. Among these, "Wait, think safely" achieves the highest safety reward while maintaining a low KL divergence, making it the optimal choice for inference-time steering.
  • ...and 11 more figures