Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement
Anyi Wang, Xuansheng Wu, Dong Shu, Yunpu Ma, Ninghao Liu
TL;DR
This work tackles the problem of steering LLMs without retraining in data-scarce settings, where steering vectors learned from limited data suffer from noise and bias. It introduces SAE-RSV, a framework that denoises steering vectors by filtering semantically irrelevant features and augments them with semantically related missing features using Sparse Autoencoders and LLM explanations. Across five concepts on Llama-3-8B-Instruct with only 50 training pairs, SAE-RSV consistently outperforms baselines including CAA and LoRA-SFT, and analyses show effective feature counts around 15–20 with robustness to data size and hyperparameters. The approach offers a practical, interpretable path to reliable low-resource steering, with potential broad impact on controllability and safety of LLMs in real-world applications.
Abstract
Steering has emerged as a promising approach in controlling large language models (LLMs) without modifying model parameters. However, most existing steering methods rely on large-scale datasets to learn clear behavioral information, which limits their applicability in many real-world scenarios. The steering vectors extracted from small dataset often contain task-irrelevant noising features, which degrades their effectiveness. To refine the steering vectors learned from limited data, we introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors. In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features. Extensive experiments demonstrate that the proposed SAE-RSV substantially outperforms all the baseline methods including supervised fine-tuning. Our findings show that effective steering vector can be constructed from limited training data by refining the original steering vector through SAEs.
