Table of Contents
Fetching ...

Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

Xu Wang, Yan Hu, Benyou Wang, Difan Zou

TL;DR

The study investigates whether higher SAE interpretability yields better LLM steering utility, revealing a relatively weak association (Kendall’s tau_b ≈ 0.298) and a notable interpretability-utility gap. It introduces Delta Token Confidence to screen SAE features by their impact on the next-token distribution, achieving on average a 52.52% improvement in steering over the best prior selector across three LLMs. Importantly, after applying this feature selection, the interpretability-utility correlation collapses to near zero or becomes negative for high-utility features, highlighting a divergence between what is interpretable and what effectively steers. The findings advocate for utility-oriented SAE training or post-hoc feature selection and motivate developing general-purpose utility indicators to reliably predict steerability across models and architectures.

Abstract

Sparse Autoencoders (SAEs) are widely used to steer large language models (LLMs), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet, a fundamental question remains unanswered: does higher interpretability indeed imply better steering utility? To answer this question, we train 90 SAEs across three LLMs (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels, and evaluate their interpretability and steering utility based on SAEBench (arXiv:2501.12345) and AxBench (arXiv:2502.23456) respectively, and perform a rank-agreement analysis via Kendall's rank coefficients (tau b). Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretability utility gap may stem from the selection of SAE features, as not all of them are equally effective for steering. To further find features that truly steer the behavior of LLMs, we propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution. We show that our method improves the steering performance of three LLMs by 52.52 percent compared to the current best output score based criterion (arXiv:2503.34567). Strikingly, after selecting features with high Delta Token Confidence, the correlation between interpretability and utility vanishes (tau b approx 0), and can even become negative. This further highlights the divergence between interpretability and utility for the most effective steering features.

Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

TL;DR

The study investigates whether higher SAE interpretability yields better LLM steering utility, revealing a relatively weak association (Kendall’s tau_b ≈ 0.298) and a notable interpretability-utility gap. It introduces Delta Token Confidence to screen SAE features by their impact on the next-token distribution, achieving on average a 52.52% improvement in steering over the best prior selector across three LLMs. Importantly, after applying this feature selection, the interpretability-utility correlation collapses to near zero or becomes negative for high-utility features, highlighting a divergence between what is interpretable and what effectively steers. The findings advocate for utility-oriented SAE training or post-hoc feature selection and motivate developing general-purpose utility indicators to reliably predict steerability across models and architectures.

Abstract

Sparse Autoencoders (SAEs) are widely used to steer large language models (LLMs), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet, a fundamental question remains unanswered: does higher interpretability indeed imply better steering utility? To answer this question, we train 90 SAEs across three LLMs (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels, and evaluate their interpretability and steering utility based on SAEBench (arXiv:2501.12345) and AxBench (arXiv:2502.23456) respectively, and perform a rank-agreement analysis via Kendall's rank coefficients (tau b). Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretability utility gap may stem from the selection of SAE features, as not all of them are equally effective for steering. To further find features that truly steer the behavior of LLMs, we propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution. We show that our method improves the steering performance of three LLMs by 52.52 percent compared to the current best output score based criterion (arXiv:2503.34567). Strikingly, after selecting features with high Delta Token Confidence, the correlation between interpretability and utility vanishes (tau b approx 0), and can even become negative. This further highlights the divergence between interpretability and utility for the most effective steering features.

Paper Structure

This paper contains 34 sections, 17 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of our goal: building a bridge for SAE interpretability and utility.Interpretability (left): an SAE attached to the LLM decomposes hidden states into sparse, human-describable features. An LLM judge yields an interpretability score for the SAE paulo2025automatically. Utility (right): at inference, we activate a target SAE feature (e.g., 'cake') to steer generation. An LLM judge yields steering utility scorewu2025axbenchsteeringllmssimple.
  • Figure 2: Overview of our pairwise-controlled workflow linking SAE interpretability with steering utility.(S1) Compute interpretability score and steering score for each SAE. (S2) Pairwise analysis across SAEs and get an insight (the top-right green box), revealing an interpretability–utility gap. The red box (lower right) is our further inference based on the above green box and previous studies wu2025axbenchsteeringllmssimple. (S3) Use $\Delta$ Token Confidence to select higher-utility features. (S4) Compute steering gains after selection per SAE, then do the pairwise analysis between steering gains and interpretability. The green box in the middle left is our final conclusion.
  • Figure 3: Distribution of per-feature $\Delta$ Token Confidence across all SAEs. Panels show histograms for Gemma-2-2B, Qwen-2.5-3B, and Gemma-2-9B; the $x$-axis is $\Delta C_k$ (negative values indicate increased confidence, positive values decreased confidence) and the $y$-axis is the number of SAE features. The shaded area marks the high-magnitude tails from which candidate steering features are selected, while the central mass near $0$ indicates features with little distributional impact.
  • Figure 4: Comparison of different SAE steering methods with five SAE architecture across three LLMs. Panels correspond to Gemma-2-2B, Qwen-2.5-3B, and Gemma-2-9B. The horizontal axis groups SAE architectures (BatchTopK, Gated, JumpReLU, ReLU, TopK), and the vertical axis reports the steering score. Bars show three conditions: SAE Base (no feature selection), Output Score Selection, and $\Delta$ Token Confidence Selection (ours). Panel annotations summarize the average lift of each selection method relative to the SAE-based steering.
  • Figure 5: SAEbench results for Gemma-2-2B: interpretability remains strong at lower $L_0$, absorption stays low for compact codes, Core is near ceiling, and structure (SCR/RAVEL) improves with moderate capacity.
  • ...and 5 more figures