Table of Contents
Fetching ...

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

Joschka Braun

TL;DR

This thesis investigates why steering reliability differs across behaviors and how it is impacted by steering vector training data and suggests that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction.

Abstract

Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

TL;DR

This thesis investigates why steering reliability differs across behaviors and how it is impacted by steering vector training data and suggests that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction.

Abstract

Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.
Paper Structure (134 sections, 23 equations, 34 figures, 1 table)

This paper contains 134 sections, 23 equations, 34 figures, 1 table.

Figures (34)

  • Figure 1: The hypothetical representation space illustrates how features like gender and royalty can be represented as linear directions. The representation of "king" for instance, can be decomposed into its components along the "male" and "royal" direction within this subspace. This linear structure enables vector arithmetic operations such as analogies (e.g., "king" - "man" + "woman" $\approx$ "queen") through vector addition and scalar multiplication. Such representation vector arithmetics, using linearly encoded features, were first demonstrated by Efficient_estimation_of_word_representations_in_vector_space in their work on Word2Vec
  • Figure 2: Linear scaling of sentiment and object size along their respective feature directions.
  • Figure 3: Because of feature entanglement in (a), changing the gender also changes the professional position. For disentangled features, gender and professional positions can be varied independently.
  • Figure 4: Illustrating CAA steering vector computation in a favorable scenario: A 2D projection at layer $l$ shows distinct positive and negative activation clusters. This allows the steering vector $\mathbf{s}^l$ to accurately approximate individual activation differences. The means of positive and negative activations are $\boldsymbol{\mu}^{l,+} = \frac{1}{|\mathcal{D}_{\text{train}}|} \sum \mathbf{a}^l(x_i, y^+_i)$ and $\boldsymbol{\mu}^{l,-} = \frac{1}{|\mathcal{D}_{\text{train}}|} \sum \mathbf{a}^l(x_i, y^-_i)$, respectively. The steering vector can be computed as the difference between these means $\mathbf{s}^l = \boldsymbol{\mu}^{l,+} - \boldsymbol{\mu}^{l,-}$, which is equivalent to the previous definition as the mean of paired activation differences.
  • Figure 5: Mean cosine similarity between the activation differences and the resulting steering vector provides a concise metric to quantify how well, on average, the steering vector aligns directionally with each individual activation differences. The directional agreement metric captures whether the activation differences point into the same direction and are well directionally approximated by a common steering vector. While the computation of the steering vector itself is insensitive to how activations are paired, the mean cosine similarity metric remains sensitive to those pairings.
  • ...and 29 more figures