Interpretable Steering of Large Language Models with Feature Guided Activation Additions
Samuel Soo, Chen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Yan Ming
TL;DR
FGAA addresses the challenge of controlled text generation by introducing a latent-space activation steering method that leverages Sparse Autoencoder features and optimization. It advances activation steering by operating in the SAE latent space, using contrastive analysis, feature filtering, and a linear effect approximator to build precise steering vectors that improve both steering effectiveness and output coherence. Empirical results on Gemma-2-2B and Gemma-2-9B show FGAA often outperforms CAA, SAE decoder steering, and SAE-TS, though performance varies with model size, and a universal trade-off between steering scale and preserving general capabilities is observed. This work contributes to interpretable, reliable steering of LLMs and suggests directions for applying FGAA to safety-critical tasks and further SAE-based methods.
Abstract
Effective and reliable control over large language model (LLM) behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate that FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also highlight important trade-offs between steering scale and general model capabilities that are consistent across all tested steering methods.
