Refusal in LLMs is an Affine Function
Thomas Marshall, Adam Scherlis, Nora Belrose
TL;DR
The paper addresses how to robustly steer LLM refusals by intervening in activation space. It introduces ACE, an affine activation editing framework that unifies and extends prior methods by using an affine decomposition ${\bf v} = {\bf v}_0 + {\rm proj}_{\bf r}^{\perp}(\Delta {\bf v}) + \alpha {\bf r}$ with $\alpha = {\bf r} \cdot \Delta {\bf v} / \|{\bf r}\|^2$, ultimately deriving the standardized update ${\bf v}' = {\bf v} - {\rm proj}_{\bf r}^{\parallel}({\bf v}) + {\rm proj}_{\bf r}^{\parallel}({\bf r}^-) + \alpha {\bf r}$ using ${\bf r} = {\bf r}^+ - {\bf r}^-$. The results show ACE provides more consistent, standardized control of refusals across multiple models (including Llama 3 70B Instruct) and avoids the nonsense outputs observed with directional ablation on some architectures (e.g., Hermes Eagle RWKV v5). This affine formulation offers improved generalizability and reliability for behavior steering, with potential applicability to a broader set of prompts and behaviors beyond refusal.
Abstract
We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and use it to control refusal behavior on ten different models, including Llama 3 70B. ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses across prompt types. We evaluate the results using LLM-based scoring on a collection of harmful and harmless prompts. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods and generalizes to models where directional ablation via affine subspace projection alone produces incoherent outputs. Code for reproducing our results is available at https://github.com/EleutherAI/steering-llama3 .
