Understanding Refusal in Language Models with Sparse Autoencoders
Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy
TL;DR
This work investigates how refusals are encoded in instruction-tuned LLMs by identifying latent refusal features with sparse autoencoders and validating their causal influence on generation. By combining Attribution Patching and Activation Steering within an SAE feature space, the authors extract a minimal, faithful set of refusal mediators and demonstrate that harm features can act as upstream triggers for refusal, while adversarial jailbreaks suppress these features. The study shows that refusal features are separable from harm features and that these features improve generalization to out-of-distribution adversarial samples when used in linear probes. The results offer a mechanistic view of refusal, reveal interactions between harm and refusal, and provide practical tools for debugging safety behavior, with code released for reproducibility.
Abstract
Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.
