Table of Contents
Fetching ...

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger

TL;DR

The paper addresses how refusals in large language models can be bypassed and challenges the notion of a single linear refusal direction. It develops gradient-based representation engineering, including Refusal Direction Optimization (RDO) and Refusal Cone Optimization (RCO), and introduces Representational Independence (RepInd) to characterize independent mechanisms in activation space. The results reveal high-dimensional refusal cones (up to dimension 4–5) and multiple RepInd directions, some accessible via input manipulation, highlighting geometric complexity beyond prior assumptions. This framework provides a practical path toward analyzing and strengthening safety alignment and can be extended to other concepts beyond refusal.

Abstract

The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

TL;DR

The paper addresses how refusals in large language models can be bypassed and challenges the notion of a single linear refusal direction. It develops gradient-based representation engineering, including Refusal Direction Optimization (RDO) and Refusal Cone Optimization (RCO), and introduces Representational Independence (RepInd) to characterize independent mechanisms in activation space. The results reveal high-dimensional refusal cones (up to dimension 4–5) and multiple RepInd directions, some accessible via input manipulation, highlighting geometric complexity beyond prior assumptions. This framework provides a practical path toward analyzing and strengthening safety alignment and can be extended to other concepts beyond refusal.

Abstract

The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.

Paper Structure

This paper contains 18 sections, 8 equations, 16 figures, 5 tables, 2 algorithms.

Figures (16)

  • Figure 1: An example of a 3D concept cone with its basis vectors. All directions in the cone mediate refusal.
  • Figure 2: Attack success rates of refusal directions for different models. We compare the DIM direction baseline that is extracted from prompts against our Refusal Direction Optimization for jailbreaking with directional ablation and activation subtraction.
  • Figure 3: Attack success rate for multi-dimensional cones for Gemma 2, Qwen 2.5 and Llama 3. The cone performance is measured via the performance of Monte Carlo samples which are depicted as boxplot.
  • Figure 4: Refusal evaluation for different cone dimensions for the Qwen2.5 model family. The cone performance for models with fewer parameters degrades faster with increasing cone dimension compared to larger models.
  • Figure 5: ASR for best-of-N sampling using $N$ samples from the 4-dimensional refusal cone of Gemma-2-2B, compared to best-of-N sampling with temperature $T$ using the single-dimension RDO.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Definition 4.1
  • Definition 6.1