Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Samuel Soo; Chen Guang; Wesley Teng; Chandrasekaran Balaganesh; Tan Guoxian; Yan Ming

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Samuel Soo, Chen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Yan Ming

TL;DR

FGAA addresses the challenge of controlled text generation by introducing a latent-space activation steering method that leverages Sparse Autoencoder features and optimization. It advances activation steering by operating in the SAE latent space, using contrastive analysis, feature filtering, and a linear effect approximator to build precise steering vectors that improve both steering effectiveness and output coherence. Empirical results on Gemma-2-2B and Gemma-2-9B show FGAA often outperforms CAA, SAE decoder steering, and SAE-TS, though performance varies with model size, and a universal trade-off between steering scale and preserving general capabilities is observed. This work contributes to interpretable, reliable steering of LLMs and suggests directions for applying FGAA to safety-critical tasks and further SAE-based methods.

Abstract

Effective and reliable control over large language model (LLM) behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate that FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also highlight important trade-offs between steering scale and general model capabilities that are consistent across all tested steering methods.

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

TL;DR

Abstract

Paper Structure (33 sections, 4 equations, 11 figures, 6 tables)

This paper contains 33 sections, 4 equations, 11 figures, 6 tables.

Introduction
Related Work
Mechanistic Interpretability and SAEs
Linear Representation Hypothesis
Activation Steering
Feature Guided Activation Additions
SAE-Based Contrastive Analysis
Feature Filtering
Linear Approximator Optimization
Final Steering Application
Evaluations and Discussion
Effectiveness of FGAA for Steering
Advantages over Existing Methods
Effects of Steering on General Model Capabilities
Limitations
...and 18 more sections

Figures (11)

Figure 1: Diagram showing the process for computing $\mathbf{v}_\text{diff}$ on a simplified "Anger" task.
Figure 2: Plots showing mean BCS with 95% confidence intervals for the CAA, SAE, SAE-TS and FGAA steering methods on 9 tasks, for Gemma-2-2B.
Figure 3: Plots showing mean BCS with 95% confidence intervals for the CAA, SAE, SAE-TS and FGAA steering methods on 9 tasks, for Gemma-2-9B.
Figure 4: Relative perplexity vs steering scale (0-300). Lower values indicate better preserved language modeling. Results averaged across steering vectors from 9 different tasks, evaluated on the first 100 records in OpenWebText.
Figure 5: Benchmark performance vs steering scale (0-200). Higher values indicate better capability preservation. Results averaged across steering vectors from 3 tasks (Anger, Christian Evangelist and Conspiracy).
...and 6 more figures

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

TL;DR

Abstract

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Authors

TL;DR

Abstract

Table of Contents

Figures (11)