Table of Contents
Fetching ...

Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

Yuyi Huang, Runzhe Zhan, Lidia S. Chao, Ailin Tao, Derek F. Wong

TL;DR

Path Drift defines a trajectory-level vulnerability in LRMs with Long-CoT, where multi-step reasoning gradually departs from safety even when early steps appear compliant. The authors formalize Path Drift and identify three triggers—first-person commitments, ethical evaporation, and condition chains—and validate a three-stage Path Drift Induction Framework: Cognitive Load Amplification, Self-Goal Priming, and Chain Injection. They propose defenses including Role Attribution Correction and Metacognitive Reflection, and outline training- and inference-time interventions to restore safety during reasoning. The work demonstrates substantial reductions in refusals under FP prompts and highlights the need for trajectory-level alignment oversight to mitigate inner-chain hijacking and safety fatigue in reasoning-centric models.

Abstract

As large language models (LLMs) are increasingly deployed for complex reasoning tasks, Long Chain-of-Thought (Long-CoT) prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) first-person commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection (reflective safety cues). Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.

Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

TL;DR

Path Drift defines a trajectory-level vulnerability in LRMs with Long-CoT, where multi-step reasoning gradually departs from safety even when early steps appear compliant. The authors formalize Path Drift and identify three triggers—first-person commitments, ethical evaporation, and condition chains—and validate a three-stage Path Drift Induction Framework: Cognitive Load Amplification, Self-Goal Priming, and Chain Injection. They propose defenses including Role Attribution Correction and Metacognitive Reflection, and outline training- and inference-time interventions to restore safety during reasoning. The work demonstrates substantial reductions in refusals under FP prompts and highlights the need for trajectory-level alignment oversight to mitigate inner-chain hijacking and safety fatigue in reasoning-centric models.

Abstract

As large language models (LLMs) are increasingly deployed for complex reasoning tasks, Long Chain-of-Thought (Long-CoT) prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) first-person commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection (reflective safety cues). Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.

Paper Structure

This paper contains 30 sections, 2 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Conceptual Illustration of Path Drift in Long-CoT Reasoning. Three reasoning trajectories induced by different inputs, illustrating how semantic path drift can accumulate into alignment failure. In the left panel ($x_1$), the model follows a stable trajectory where all reasoning steps are safe ($\delta=0$), leading to a compliant output ($y_1 \in \mathcal{P}$). In the middle panel ($x_2$), the trajectory begins safely but deviates at $r_3^b$ ($\delta=1$), with small semantic drifts compounding into an unsafe output $y_2$. In the right panel ($x_3$), early branching driven by subtle cues rapidly accumulates unsafe steps and risk-amplifying branches, producing an unsafe output $y_3$. Each reasoning step $r_i$ is annotated with $\delta(r_i)$, where $\delta=0$ indicates alignment with the safety policy and $\delta=1$ denotes path drift or misalignment. The color gradient reflects increasing severity of unsafe reasoning.
  • Figure 2: Reasoning pathway induced by first-person mode in LRMs.
  • Figure 3: Refusal rates across models under first-person (active) vs third-person (non-active) prompting.
  • Figure 4: Word frequency heatmap of refusal-related and ethical terms across prompting strategies.The figure shows the distribution of key refusal and ethical marker tokens across three prompting conditions: TP (third-person), FP (first-person), and FP_NE (first-person with no-ethics override).
  • Figure 5: Token-level logit trajectories under incremental condition chains. The figure visualizes the token logit values for the target word "harm" across 50 decoding steps under three prompting conditions: Baseline (blue), Semantic Hint (orange), and Full Condition Chain (green). Shaded areas denote $\pm 1$ standard deviation across models.
  • ...and 6 more figures