Table of Contents
Fetching ...

Steering Large Language Model Activations in Sparse Spaces

Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, Pascal Vincent

TL;DR

This work introduces Sparse Activation Steering (SAS), a framework that steers large language models by operating in sparse activation spaces learned via Sparse Autoencoders (SAEs). SAS uses contrastive prompt-pairing to identify behavior-specific sparse features and forms steering vectors that reinforce desired behaviors while suppressing opposing tendencies, applied during inference without weight updates. Scaling the SAE dictionary improves monosemanticity and enables compositional steering of multiple behaviors with minimal or even positive effects on standard benchmarks and targeted tasks like TruthfulQA. The approach offers flexible, context-aware control, presenting a practical path toward fine-grained alignment with robust interpretability and modularity.

Abstract

A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations provide an untapped opportunity for more interpretable behavior modulation. In this work, we introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer LLM behavior in sparse spaces. By isolating behavior-specific features through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable nuanced behavioral modulation and finer-grained control. Furthermore, scaling SAEs improves monosemanticity of SAS vectors, suggesting more reliable and interpretable interventions.

Steering Large Language Model Activations in Sparse Spaces

TL;DR

This work introduces Sparse Activation Steering (SAS), a framework that steers large language models by operating in sparse activation spaces learned via Sparse Autoencoders (SAEs). SAS uses contrastive prompt-pairing to identify behavior-specific sparse features and forms steering vectors that reinforce desired behaviors while suppressing opposing tendencies, applied during inference without weight updates. Scaling the SAE dictionary improves monosemanticity and enables compositional steering of multiple behaviors with minimal or even positive effects on standard benchmarks and targeted tasks like TruthfulQA. The approach offers flexible, context-aware control, presenting a practical path toward fine-grained alignment with robust interpretability and modularity.

Abstract

A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations provide an untapped opportunity for more interpretable behavior modulation. In this work, we introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer LLM behavior in sparse spaces. By isolating behavior-specific features through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable nuanced behavioral modulation and finer-grained control. Furthermore, scaling SAEs improves monosemanticity of SAS vectors, suggesting more reliable and interpretable interventions.

Paper Structure

This paper contains 61 sections, 24 equations, 31 figures, 2 tables, 2 algorithms.

Figures (31)

  • Figure 1: Sparse Activation Steering (SAS) Vector Generation. The process of generating SAS vectors consists of six steps: (1) Construct a contrastive pair of prompts, where one completion exhibits the desired behavior (positive) and the other its opposite (negative). (2) Extract sparse representations of activations from a selected model layer using a Sparse Autoencoder (SAE) encoder $\boldsymbol{f}(\boldsymbol{a})$. (3) Filter out inactive features using an activation frequency threshold $\tau$. (4) Remove shared features between the positive and negative representations to isolate behavior-specific components. (5) Compute mean activation vectors from the sparse matrices of positive and negative completions. (6) Construct the final SAS vector by subtracting the negative mean vector from the positive mean vector. The resulting vector reinforces the intended behavior through its "positive components" while suppressing the model’s existing tendencies that contradict the target behavior through its "negative components" during inference. See the algorithm in \ref{['app:algorithm_gen']}.
  • Figure 2: Applying SAS vectors during inference. Given an input prompt, the activations from a specific layer $\ell$ are first encoded into a sparse representation using a Sparse Autoencoder (SAE) encoder ($\boldsymbol{f}(\boldsymbol{a}) = \sigma(W_{\text{enc}} \boldsymbol{a} + \boldsymbol{b}_{\text{enc}})$). The SAS vector, scaled by the parameter $\lambda$, is then added to the sparse representation to adjust the model’s behavior: positive components reinforce the target behavior, while negative components suppress model tendencies that contradict it. The modified sparse representation is processed through the SAE non-linearity $\sigma$ once more to ensure consistency with the learned sparse distribution before being decoded back into the dense activation space. See the algorithm in \ref{['app:algorithm_inf']}.
  • Figure 3: Impact of $\tau$ on Behavior Steering. Effect of varying $\tau$, which controls the sparsity of SAS vectors, on behavior modulation. Lower values of $\tau$ (e.g., $0.7$) retain more active features, reducing reconstruction loss and leading to stronger behavior shifts. Higher values of $\tau$ (e.g., $0.9$) enforce greater sparsity while preserving key features necessary for effective steering. Experiments were conducted on Gemma-2 2B with $\lambda = \pm1$ and an SAE with a dictionary size of 65K.
  • Figure 4: Impact of $\lambda$ on Behavior Steering. Effect of increasing $\lambda$, which determines the strength of SAS vectors during inference. As $\lambda$ increases from $\pm1$ to $\pm2$, the steering effect intensifies, leading to more significant shifts in behavior alignment. Positive steering ($\lambda > 0$) reinforces the target behavior, while negative steering ($\lambda < 0$) suppresses it. Experiments were conducted on Gemma-2 2B using an SAE with a dictionary size of $65K$ and $\tau = 0.7$.
  • Figure 5: Open-Ended Generation Evaluation. Normalized behavioral scores (relative to $\lambda = 0$) for all behaviors as a function of the steering parameter $\lambda$. (Left) Standard open-ended evaluation where the model generates responses without answer choices or the answer prefix. (Middle) Evaluation with the prefix "The answer is:" added to guide the model toward directly answering the question. (Right) Evaluation where answer choices are provided to the model alongside the prefix, and an LLM is used as a judge for open-ended responses. Higher $\lambda$ values generally increase adherence to the target behaviors. Experiments were conducted on the Gemma-2 2B model using an SAE with a dictionary size of $65K$, $\tau = 0.7$, and $\lambda = \pm1$ at layer $14$. Additional details and results for other layers can be found in \ref{['app:open_gen_all']}.
  • ...and 26 more figures