Table of Contents
Fetching ...

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

Utsav Maskey, Mark Dras, Usman Naseem

Abstract

Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing confirms that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

Abstract

Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing confirms that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.

Paper Structure

This paper contains 32 sections, 2 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: PCA of residual stream at early, mid and late layers. Black arrow: global DIM (difference-in-mean) refusal direction; colored arrows: NLP task-specific directions (Sentiment Analysis, Rephrase, etc.). Top row: harmful-refusal, where task-specific directions converge onto the global vector by mid-layers, confirming a single task-agnostic direction suffices. Bottom row: over-refusal, where task-specific directions diverge, with no shared refusal zone, making global intervention structurally imprecise.
  • Figure 2: Harmful-refusal DIM direction selection. Left: projection score distributions at L11, where refused-harmful and harmless-answered are cleanly separated. Right: score gap across layers, peaking at L11 and L17.
  • Figure 3: PCA at the peak task-identity layer (layer 12). Left: five task clusters. Right: over-refusal (triangles) sits overlapping within the non-refusal clusters.
  • Figure 4: Per-task centroid distances vs. inter-task / global centroid distances across layers. The gap confirms that over-refusal is a within-cluster perturbation and is less likely to be task-agnostic.
  • Figure 5: Inter-task / global silhouette (left) peaks at layer 12; per-task behavioural silhouette (right) peaks later and much weaker, confirming task structures form in the earlier layers.
  • ...and 9 more figures