Table of Contents
Fetching ...

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, Christopher Potts

TL;DR

<3-5 sentence high-level summary> AxBench introduces a large-scale benchmark for steering LLMs and detecting concepts, enabling direct comparisons across prompting, finetuning, and representation-based methods. Across open-vocabulary concepts and long-form generation, prompting and standard finetuning outperform representation-based approaches like sparse autoencoders (SAEs); however, a novel weakly-supervised method, ReFT-r1, approaches the efficacy of these baselines while offering interpretability advantages. The study also releases SAE-scale dictionaries and Concept16K datasets to foster further research. Overall, AxBench reveals that representation-based steering remains lagging behind traditional control methods, but joint learning approaches such as ReFT-r1 show meaningful promise for closing the gap. The work emphasizes the need for comprehensive benchmarks to drive progress in LM control techniques.

Abstract

Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

TL;DR

<3-5 sentence high-level summary> AxBench introduces a large-scale benchmark for steering LLMs and detecting concepts, enabling direct comparisons across prompting, finetuning, and representation-based methods. Across open-vocabulary concepts and long-form generation, prompting and standard finetuning outperform representation-based approaches like sparse autoencoders (SAEs); however, a novel weakly-supervised method, ReFT-r1, approaches the efficacy of these baselines while offering interpretability advantages. The study also releases SAE-scale dictionaries and Concept16K datasets to foster further research. Overall, AxBench reveals that representation-based steering remains lagging behind traditional control methods, but joint learning approaches such as ReFT-r1 show meaningful promise for closing the gap. The work emphasizes the need for comprehensive benchmarks to drive progress in LM control techniques.

Abstract

Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.

Paper Structure

This paper contains 74 sections, 18 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Average results across eight tasks on concept detection (0--2) vs. model steering (0--2) for all methods on AxBench. *Only evaluated on Gemma-2-2B.
  • Figure 2: Key components of AxBench: (a) an example of how we collect data for evaluating concept detection and model steering; (b) the synthetic data generation process for training and evaluation given Golden Gate Bridge as a concept; and (c) the contrasting training pipelines of SAEs and SDLs; both use LLMs, but SAEs use them to label pretrained features while we instead direct them to generate training data.
  • Figure 3: Mean F1 scores vs. dataset balance.
  • Figure 4: Mean concept score vs. instruct score as the steering factor for each method is varied.
  • Figure 5: Mean ROC curves over all concepts.
  • ...and 10 more figures