Table of Contents
Fetching ...

Analyzing the Generalization and Reliability of Steering Vectors

Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk

TL;DR

<3-5 sentence high-level summary> This paper rigorously evaluates steering vectors (SVs) as inference-time interventions that modify intermediate activations to steer language model behavior. Using a broad suite of Model-Written Evaluations (MWE) and controlled prompt distribution shifts, it quantifies in-distribution steerability and out-of-distribution generalisation. The key findings reveal substantial per-sample variability, anti-steerable cases, and a prominent steerability bias tied to tokens/positions, with generalisation largely dependent on dataset properties and model propensity. The results highlight that while SVs can work in some settings, they are not a universal, scalable solution and require further work to understand and mitigate biases and to improve cross-prompt generalisation.

Abstract

Steering vectors (SVs) have been proposed as an effective approach to adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain technical difficulties of applying steering vectors to guide models' behaviour at scale. Our code is available at https://github.com/dtch1997/steering-bench

Analyzing the Generalization and Reliability of Steering Vectors

TL;DR

<3-5 sentence high-level summary> This paper rigorously evaluates steering vectors (SVs) as inference-time interventions that modify intermediate activations to steer language model behavior. Using a broad suite of Model-Written Evaluations (MWE) and controlled prompt distribution shifts, it quantifies in-distribution steerability and out-of-distribution generalisation. The key findings reveal substantial per-sample variability, anti-steerable cases, and a prominent steerability bias tied to tokens/positions, with generalisation largely dependent on dataset properties and model propensity. The results highlight that while SVs can work in some settings, they are not a universal, scalable solution and require further work to understand and mitigate biases and to improve cross-prompt generalisation.

Abstract

Steering vectors (SVs) have been proposed as an effective approach to adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain technical difficulties of applying steering vectors to guide models' behaviour at scale. Our code is available at https://github.com/dtch1997/steering-bench
Paper Structure (47 sections, 3 equations, 22 figures, 3 tables)

This paper contains 47 sections, 3 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Steering effects are not reliable, and often steer in the opposite direction. We show per-sample steerability and the fraction of anti-steerable examples for a representative sample of 13 datasets (out of 40 total). Many dataset have a high variation in per-sample steerability, and several datasets produce the opposite behaviour for almost 50% of inputs. For all datasets see \ref{['fig:per_sample_steerability_anti_all40']}. Some dataset names have been shortened.
  • Figure 2: Example propensity curve and steerability fit for high steerability (left), and low (right).
  • Figure 3: Models exhibit large dataset-dependent steerability bias. The figure shows mean steerability per dataset for each way in which the positive option is presented. And entirely unbiased result would have all bars being identical. Despite datasets being balanced amongst all possible combinations of options, the mean steerability differs greatly between these splits. While there is a general trend towards preferring 'Yes' vs 'No', there is still a lot of dataset-dependent variation, and there is no clear trend for 'A' vs 'B'. For full results see \ref{['fig:plot_slope_and_counts_for_response_is_Yes_all40']}. Note that some datasets have only two bars, indicating that only the 'A'/'B' split is relevant.
  • Figure 4: SVs exhibit high variance, some of which is explained by spurious factors. The figure shows variance in per-sample steerability by dataset, with attributions to known spurious factors annotated. Marginal Var Explained refers to the variance explained by the 'Yes'/'No' split after removing variance from the 'A'/'B' split. For some datasets, spurious factors (orange, green) explain a large percentage of the variance, while for others, most of the variance remains unexplained. For full results see \ref{['fig:explained_variance_steerability_all40']}.
  • Figure 5: In-distribution and out-of-distribution steerability are reasonably well-correlated. We show OOD vs ID steerability for Llama-2-7b (left; $=\rho = 0.891$) and Qwen-1.5-14b (right; $\rho = 0.694$). While OOD steerability seems correlated with ID steerability, we observe that there are some points far above or below the $x = y$ line, and this is more noticeable for the Qwen model. Throughout, $\rho$ refers to Spearman's rank correlation coefficient.
  • ...and 17 more figures