Table of Contents
Fetching ...

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Zirui He, Mingyu Jin, Bo Shen, Ali Payani, Yongfeng Zhang, Mengnan Du

TL;DR

This work tackles the challenge of reliably steering open-ended generation in large language models by constraining interventions to a sparse, task-relevant latent subspace learned with sparse autoencoders. It introduces SAE-SSV, which uses dimension probing to identify a compact subspace and learns a supervised steering vector within it, balancing behavioral alignment with generation quality through a composite loss. Empirical results across sentiment, truthfulness, and political polarity steering show SAE-SSV achieves higher steering success with minimal degradation compared to baselines, and analysis reveals a small subspace suffices for effective control and interpretability. The approach, motivated by representation disentanglement and supervision, offers targeted, efficient interventions with potential for broader applicability and future universal steering vectors across tasks and models.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs) to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and political polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions. Our implementation is publicly available at https://github.com/Ineedanamehere/SAE-SSV.

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

TL;DR

This work tackles the challenge of reliably steering open-ended generation in large language models by constraining interventions to a sparse, task-relevant latent subspace learned with sparse autoencoders. It introduces SAE-SSV, which uses dimension probing to identify a compact subspace and learns a supervised steering vector within it, balancing behavioral alignment with generation quality through a composite loss. Empirical results across sentiment, truthfulness, and political polarity steering show SAE-SSV achieves higher steering success with minimal degradation compared to baselines, and analysis reveals a small subspace suffices for effective control and interpretability. The approach, motivated by representation disentanglement and supervision, offers targeted, efficient interventions with potential for broader applicability and future universal steering vectors across tasks and models.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs) to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and political polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions. Our implementation is publicly available at https://github.com/Ineedanamehere/SAE-SSV.

Paper Structure

This paper contains 23 sections, 8 equations, 24 figures, 12 tables.

Figures (24)

  • Figure 1: Overview of the SAE-SSV framework. It encodes model activations into a sparse latent space, selects task-relevant dimensions via linear probes, and optimizes steering vectors with combined losses to ensure effective control while maintaining generation quality.
  • Figure 2: Activation heatmaps of the top-30 dimensions for each task. (a) Sentiment task. (b) Truthfulness task. Each panel compares class-wise activation patterns in the raw residual space and SAE space.
  • Figure 3: (a) shows how the number of linear classifiers affects feature selection stability. (b) shows that a small number of top SAE dimensions enable clear class separation.
  • Figure 4: Average projection values of token activations along four directions: no steering (gray), SAE-SSV (blue), orthogonal (green), and random (orange). Computed over successfully steered samples. SAE-SSV induces a consistent and sustained directional shift, while other directions show minimal change.
  • Figure 5: Case study on the sentiment steering task. The input prompts are negative movie reviews. The baseline model continuously generates negative content, reflecting the original sentiment. Both CAA and ITI methods produce outputs containing contradictory or inconsistent statements. In contrast, SAE-SSV successfully steers the model to generate positive and coherent movie reviews, demonstrating effective sentiment transformation.
  • ...and 19 more figures