Table of Contents
Fetching ...

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, Battista Biggio

TL;DR

This work reevaluates how refusal behavior in large language models is represented, challenging the conventional single-direction view. It introduces a multi-directional (MD) framework that uses Self-Organizing Maps (SOMs) to uncover a refusal manifold and derive multiple directions for ablation, generalizing the prior centroid-based approach. Across diverse models and jailbreak baselines, MD consistently improves attack success rate (ASR) by leveraging multiple, related directions and optimizing their selection via Bayesian Optimization. Mechanistic analysis shows that MD compresses harmful representations and aligns them closer to harmless ones, supporting a manifold-based understanding of refusal and offering a richer framework for safety evaluation and mechanistic interpretability.

Abstract

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

TL;DR

This work reevaluates how refusal behavior in large language models is represented, challenging the conventional single-direction view. It introduces a multi-directional (MD) framework that uses Self-Organizing Maps (SOMs) to uncover a refusal manifold and derive multiple directions for ablation, generalizing the prior centroid-based approach. Across diverse models and jailbreak baselines, MD consistently improves attack success rate (ASR) by leveraging multiple, related directions and optimizing their selection via Bayesian Optimization. Mechanistic analysis shows that MD compresses harmful representations and aligns them closer to harmless ones, supporting a manifold-based understanding of refusal and offering a richer framework for safety evaluation and mechanistic interpretability.

Abstract

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

Paper Structure

This paper contains 18 sections, 1 theorem, 10 equations, 31 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

Let $|\mathcal{I}|=1$, and let $w^{(t)}$ be the neuron deduced by applying the procedure in eq:som. If $\alpha_t\equiv\alpha$, with $\alpha<\frac{1}{2}$, then i.e., the only neuron of the SOM is arbitrarily close to the centroid of the distribution.

Figures (31)

  • Figure 1: Single and multiple directions in the representation space of Llama2-7B. While SD (left) captures a single view of refusal, our MD (right) approach, via a 4x4 SOM, enables a multi-faceted perspective of refusal, and, thus, a higher Attack Success Rate (ASR).
  • Figure 2: 3D PCA of Llama2-7B internals. As we ablate directions, harmful prompts are represented by the model with reduced variance ($\sigma$) and approach harmless distribution (measured by the Euclidean distance between the centroids, $\Delta\mu$).
  • Figure 3: 3D PCA of SOM neurons on harmful prompts' internal representations. Across all models, SOMs organize neurons to span the underlying manifold, covering the entire space. Black lines connect neighboring neurons according to the SOM grid.
  • Figure 4: Cosine similarities across MD directions (and SD) on LLama2-7B (left) and Qwen-14B (right) models. The directions are strongly aligned with each other, indicating the offered multi-faceted, coherent perspective of refusal.
  • Figure 5: An overview of the proposed approach.
  • ...and 26 more figures

Theorems & Definitions (4)

  • Definition 1: Ablation Operator
  • Proposition 1: Centroid Convergence of 1-Neuron SOM
  • proof
  • proof