Table of Contents
Fetching ...

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee

TL;DR

The paper investigates the mechanistic basis of refusal in instruction-tuned LLMs by jointly leveraging sparse autoencoders and factorization machines to identify causal, jailbreak-critical latent features. It introduces a three-stage pipeline: derive a refusal steering direction, prune to a minimal faithful set, and discover higher-order interactions via a Factorization Machine to reveal non-linear dependencies and redundancy ('hydra' features). The approach demonstrates that manipulating finely parsed latent features can flip a model from refusing to complying and that non-linear interactions are essential for capturing refusals beyond linear probes. The work advances targeted, feature-level safety interventions and provides insights for auditability, while acknowledging computational costs and model-size generalization limits. Practically, it enables principled, interpretable safety tuning and lays groundwork for more precise containment and auditing of LLM refusal behaviors.

Abstract

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

TL;DR

The paper investigates the mechanistic basis of refusal in instruction-tuned LLMs by jointly leveraging sparse autoencoders and factorization machines to identify causal, jailbreak-critical latent features. It introduces a three-stage pipeline: derive a refusal steering direction, prune to a minimal faithful set, and discover higher-order interactions via a Factorization Machine to reveal non-linear dependencies and redundancy ('hydra' features). The approach demonstrates that manipulating finely parsed latent features can flip a model from refusing to complying and that non-linear interactions are essential for capturing refusals beyond linear probes. The work advances targeted, feature-level safety interventions and provides insights for auditability, while acknowledging computational costs and model-size generalization limits. Practically, it enables principled, interpretable safety tuning and lays groundwork for more precise containment and auditing of LLM refusal behaviors.

Abstract

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

Paper Structure

This paper contains 42 sections, 4 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: Figure shows the steps involved in causal feature search. Trapezium shapes represent encoder and decoder of a SAE, with the rectangular block in between representing the sparse encoder activations, where ablation is applied.
  • Figure 2: Shown above is part of the computation flow in a decoder only LLM. Attached to a layer is a SAE. Square boxes denote SAE encoder activations. Orange denotes ablated feature and green activated. Here, we demonstrate that ablating some of the early layer features (orange) can activate (green) a set of features in a downstream layer. These downstream features are causal to refusal in-spite of being not active in the first place. Shown on top right is the response (jailbroken) after ablating these (orange+green) and on the bottom right is the safe response when these are not ablated.
  • Figure 3: Figure shows harm types based on the Coconot unsafe taxonomy and the count of feature activations against each type (first four show individual types and remaining on the right show count of features which fire on multiple harm types.)
  • Figure 4: Figure shows cosine similarities across layers between residual activations and steering vector at the last token, for Gemma
  • Figure 5: Figure shows cosine similarities across layers between residual activations and steering vector at the last token, for LLaMA
  • ...and 2 more figures