Table of Contents
Fetching ...

The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

Isaac Llorente-Saguer

Abstract

We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $θ$ from this reference direction. The anomaly score is the negative log-likelihood of $θ$ under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC $\geq$0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ($σ_θ\approx 0.03$ rad), an order of magnitude tighter than the normative distribution ($σ_θ\approx 0.27$ rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.

The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

Abstract

We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle from this reference direction. The anomaly score is the negative log-likelihood of under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC 0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ( rad), an order of magnitude tighter than the normative distribution ( rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.

Paper Structure

This paper contains 50 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Precision-recall curves at the operating layer ($K{=}1$) for the two base variants; remaining models are in \ref{['fig:app_pr_all']}. Dotted horizontal lines indicate chance precision per task. The harmful-vs-normative curve (red) maintains precision $>0.92$ up to 90% recall for Qwen3.5-0.8B-Base (Prec@90 $= 0.928$) and $>0.90$ for Qwen2.5-0.5B-Base (Prec@90 $= 0.902$). The harmful-vs-benign-agg curve is flat at 1.000 precision across all recall levels. The normative-vs-benign-agg curve (green) lies below chance in Qwen3.5-0.8B (AUPRC $= 0.232$), confirming that benign-agg prompts are scored as less anomalous than normative, consistent with $r_\mathrm{b/n} = -0.384$.
  • Figure 2: Per-layer AUROC for the Qwen3.5-0.8B family. Each panel shows AUROC h/n (left) and AUROC h/b (right) vs. layer index. $K{=}1$ (solid blue) strictly dominates $K{>}1$ (orange/green/red) at every layer. The cosine-centroid baseline (purple dashed) consistently underperforms $K{=}1$. AUROC h/b $= 1.000$ is maintained at every layer regardless of alignment stage. The three panels are nearly indistinguishable in their h/b profile.
  • Figure 3: Per-layer AUROC for the Qwen2.5-0.5B family.$K{=}1$ (solid blue) remains the dominant scorer throughout. In the Instruct and Abliterated variants, the cosine-centroid and L2-norm baselines reach near-parity with $K{=}1$, reflecting the specific geometry of this family, but do not surpass it. The broad performance plateau across layers 5--23 bounds layer-selection optimism to $<0.08$ AUROC units.
  • Figure 4: Theta-phi projections at the operating layer. Harmful (red) and safe (blue/green) prompts form distinct concentric radial zones across all variants. In the Qwen3.5-0.8B family (left column), harmful intent occupies the outer ring; in Qwen2.5-0.5B (right column), it occupies the inner ring. The visual invariance across rows demonstrates that safety geometry is established during pretraining and remains intact even after the mathematical erasure of refusal mechanisms. All panels achieve AUROC h/b = 1.000.
  • Figure 5: Anomaly score distributions at the operating layer for all six variants. Each violin shows the marginal distribution of $s(x) = -\log p(\theta \mid \mu_0, \sigma_0^2)$ for normative eval (blue), harmful (red), benign-aggressive (green), and normative $\cup$ benign (purple). White circles denote medians; bars denote IQRs. In every panel, harmful prompts occupy a narrow, elevated band ($\sigma_\theta^\mathrm{harm}$ is 5--9$\times$ smaller than $\sigma_\theta^\mathrm{norm}$; \ref{['tab:theta_stats']}). In the Qwen3.5-0.8B family (left column), benign-aggressive scores fall below the normative distribution; in the Qwen2.5-0.5B family (right column), they overlap with it. The three panels within each column are nearly identical, illustrating that abliteration leaves the score landscape intact.
  • ...and 5 more figures