Table of Contents
Fetching ...

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu, Yi Xie, Shen Lin, Shiqian Zhao, Xiaofeng Chen

TL;DR

A causal double dissociation is demonstrated, effectively creating a state of ``Knowing without Acting.

Abstract

Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} ($\mathbf{v}_H$, ``Knowing'') and an \textit{Execution Axis} ($\mathbf{v}_R$, ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation'' evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we demonstrate a causal double dissociation, effectively creating a state of ``Knowing without Acting.'' Crucially, we leverage this disentanglement to propose the \textbf{Refusal Erasure Attack (REA)}, which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the \textit{Explicit Semantic Control} of Llama3.1 with the \textit{Latent Distributed Control} of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

TL;DR

A causal double dissociation is demonstrated, effectively creating a state of ``Knowing without Acting.

Abstract

Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} (, ``Knowing'') and an \textit{Execution Axis} (, ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation'' evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we demonstrate a causal double dissociation, effectively creating a state of ``Knowing without Acting.'' Crucially, we leverage this disentanglement to propose the \textbf{Refusal Erasure Attack (REA)}, which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the \textit{Explicit Semantic Control} of Llama3.1 with the \textit{Latent Distributed Control} of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.
Paper Structure (23 sections, 7 equations, 7 figures, 5 tables)

This paper contains 23 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Layer-wise cosine similarity between the Recognition axis $\mathbf{v}_H$ and the Execution axis $\mathbf{v}_R$. The trajectory moves from strong antagonism in early layers (Sim $\approx -0.9$) toward the random baseline (dashed line); deep-layer values lie within the plotted 95% confidence band, indicating execution-aligned safety signal becomes statistically indistinguishable from noise and thus creates a latent gap where "Knowing" need not trigger "Acting".
  • Figure 2: Overview of the Disentangled Safety Framework. Top: prior baselines assume a monolithic "harm" direction that conflates semantic recognition and refusal. Bottom: under DSH we split safety into a Recognition axis $\mathbf{v}_H$ (semantic understanding) and an Execution axis $\mathbf{v}_R$; surgically removing $\mathbf{v}_R$ via our Refusal Erasure Attack (REA) preserves harmful understanding while disabling refusal, empirically validating the decomposition.
  • Figure 3: Semantic projections of $\mathbf{v}_H$ and $\mathbf{v}_R$ at the last layer on JailbreakBench. Note the explicit semantic lock in Llama/Mistral vs. latent artifacts in Qwen.
  • Figure 4: Layer-wise Cosine Similarity between $\mathbf{v}_H$ and $\mathbf{v}_R$. The dashed line and grey band represent the mean and 95% confidence interval of 1000 random vector pairs, respectively. In deep layers, the safety axes' similarity converges to this random baseline, confirming the "Reflex-to-Dissociation" pattern.
  • Figure 5: Layer-wise Evolution of Safety Axes. Comparing Llama3.1 and Qwen2.5 on JailbreakBench. Llama (Top rows in each panel) shows a sharp phase transition to explicit semantic tokens (Red), while Qwen (Bottom rows) remains largely latent/structural (Grey) with sporadic anchors.
  • ...and 2 more figures