SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
Zirui He, Mingyu Jin, Bo Shen, Ali Payani, Yongfeng Zhang, Mengnan Du
TL;DR
This work tackles the challenge of reliably steering open-ended generation in large language models by constraining interventions to a sparse, task-relevant latent subspace learned with sparse autoencoders. It introduces SAE-SSV, which uses dimension probing to identify a compact subspace and learns a supervised steering vector within it, balancing behavioral alignment with generation quality through a composite loss. Empirical results across sentiment, truthfulness, and political polarity steering show SAE-SSV achieves higher steering success with minimal degradation compared to baselines, and analysis reveals a small subspace suffices for effective control and interpretability. The approach, motivated by representation disentanglement and supervision, offers targeted, efficient interventions with potential for broader applicability and future universal steering vectors across tasks and models.
Abstract
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs) to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and political polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions. Our implementation is publicly available at https://github.com/Ineedanamehere/SAE-SSV.
