Table of Contents
Fetching ...

Surgical Activation Steering via Generative Causal Mediation

Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell

TL;DR

Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept from contrastive long-form responses, successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads.

Abstract

Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.

Surgical Activation Steering via Generative Causal Mediation

TL;DR

Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept from contrastive long-form responses, successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads.

Abstract

Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.
Paper Structure (43 sections, 10 equations, 29 figures, 9 tables)

This paper contains 43 sections, 10 equations, 29 figures, 9 tables.

Figures (29)

  • Figure 1: A schematic overview of Generative Causal Mediation Analysis (GCM) for steering towards the verse style transfer concept which is operationalized as a dataset of paired original and contrasting inputs along with the corresponding responses. The LM is run on the original input (Talk in prose. What is time?) while an individual attention head is patched to take on the value it would have from the contrasting input (Talk in verse. What is time?). Then we measure the indirect effect of the patched attention head on increasing the likelihood of the contrasting response (River without end, time flows silent through) relative to the original response (Time is the unstoppable flow of events). Individual attention heads are ranked by the strength of this indirect effect. The subset of the top k% of ranked attention heads is then patched, all at once, to steer the model.
  • Figure 2: A comparison of the steering success rate on localization methods (columns) that identify where to apply the difference-in-means steering vector on the Qwen-14B model. The x-axis of each heatmap is the fraction of steered attention heads, $k$, and the y-axis is the scaling factor, $\alpha$ for the steering vector. The cells contain the rate of steering success. On average, GCM variants achieve a higher steering success rate (See Table. \ref{['tab:avg-all']}) Similar plots for the OlMo-13B and SOLAR-10.7B model are provided in Appendix \ref{['app:where-to-steer']}.
  • Figure 3: Steering success rate on held-out datasets is model and task dependent.
  • Figure 4: Causal abstractions for our three tasks. Each abstraction is represented by a univariate acyclic graph that abstracts the model’s processing mechanism.
  • Figure 5: A comparison of steering success rates when using difference-in-means steering and the localization methods from § \ref{['sec:where-to-steer']} on the Qwen-14B model.
  • ...and 24 more figures