Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

Joschka Braun

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

Joschka Braun

TL;DR

This thesis investigates why steering reliability differs across behaviors and how it is impacted by steering vector training data and suggests that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction.

Abstract

Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

TL;DR

Abstract

Paper Structure (134 sections, 23 equations, 34 figures, 1 table)

This paper contains 134 sections, 23 equations, 34 figures, 1 table.

Introduction
The rise of foundation models and the need for post-training adaptation
Post-training adaptations to foundation models
Activation Engineering
Steering vectors
Limitations of steering vectors
Research questions
Research scope
Thesis contributions
Background
Learned word embeddings
Dense word embedding methods
Word2Vec
Global Vectors for Word Representation (GloVe)
FastText
...and 119 more sections

Figures (34)

Figure 1: The hypothetical representation space illustrates how features like gender and royalty can be represented as linear directions. The representation of "king" for instance, can be decomposed into its components along the "male" and "royal" direction within this subspace. This linear structure enables vector arithmetic operations such as analogies (e.g., "king" - "man" + "woman" $\approx$ "queen") through vector addition and scalar multiplication. Such representation vector arithmetics, using linearly encoded features, were first demonstrated by Efficient_estimation_of_word_representations_in_vector_space in their work on Word2Vec
Figure 2: Linear scaling of sentiment and object size along their respective feature directions.
Figure 3: Because of feature entanglement in (a), changing the gender also changes the professional position. For disentangled features, gender and professional positions can be varied independently.
Figure 4: Illustrating CAA steering vector computation in a favorable scenario: A 2D projection at layer $l$ shows distinct positive and negative activation clusters. This allows the steering vector $\mathbf{s}^l$ to accurately approximate individual activation differences. The means of positive and negative activations are $\boldsymbol{\mu}^{l,+} = \frac{1}{|\mathcal{D}_{\text{train}}|} \sum \mathbf{a}^l(x_i, y^+_i)$ and $\boldsymbol{\mu}^{l,-} = \frac{1}{|\mathcal{D}_{\text{train}}|} \sum \mathbf{a}^l(x_i, y^-_i)$, respectively. The steering vector can be computed as the difference between these means $\mathbf{s}^l = \boldsymbol{\mu}^{l,+} - \boldsymbol{\mu}^{l,-}$, which is equivalent to the previous definition as the mean of paired activation differences.
Figure 5: Mean cosine similarity between the activation differences and the resulting steering vector provides a concise metric to quantify how well, on average, the steering vector aligns directionally with each individual activation differences. The directional agreement metric captures whether the activation differences point into the same direction and are well directionally approximated by a common steering vector. While the computation of the steering vector itself is insensitive to how activations are paired, the mean cosine similarity metric remains sensitive to those pairings.
...and 29 more figures

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

TL;DR

Abstract

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations

Authors

TL;DR

Abstract

Table of Contents

Figures (34)