AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, Christopher Potts
TL;DR
<3-5 sentence high-level summary> AxBench introduces a large-scale benchmark for steering LLMs and detecting concepts, enabling direct comparisons across prompting, finetuning, and representation-based methods. Across open-vocabulary concepts and long-form generation, prompting and standard finetuning outperform representation-based approaches like sparse autoencoders (SAEs); however, a novel weakly-supervised method, ReFT-r1, approaches the efficacy of these baselines while offering interpretability advantages. The study also releases SAE-scale dictionaries and Concept16K datasets to foster further research. Overall, AxBench reveals that representation-based steering remains lagging behind traditional control methods, but joint learning approaches such as ReFT-r1 show meaningful promise for closing the gap. The work emphasizes the need for comprehensive benchmarks to drive progress in LM control techniques.
Abstract
Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.
