A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering
Jiaqing Xie
TL;DR
This work tackles steering LLMs via internal activations by identifying that many top-$K$ SAE latents are non-semantic; it proposes using the single most relevant latent ($top-1$) and a token-wise decaying steering schedule with $oldsymbol{\alpha}_t$ to improve stability and comparability with MeanActDiff. The authors introduce a differential SAE feature extraction pipeline and a decaying steering mechanism, showing that SAE steering can reliably elicit step-by-step reasoning in math tasks and provides finer control in formatting tasks; it matches MeanActDiff on IF-Eval benchmarks and often outperforms MeanActDiff in several reasoning scenarios, though Chain-of-Thought prompting remains the performance ceiling. The method offers interpretability through an SAE feature dictionary and demonstrates that steering behaves similarly to appending a guiding token in some cases, while also highlighting task-dependent strengths and limitations across multilingual instruction-following. Overall, the approach enables a more principled, fair comparison of SAE-based steering to MeanActDiff, with practical implications for controllable generation in reasoning and formatting tasks. $top-1$, $oldsymbol{\alpha}_t$, and SAE feature dynamics provide interpretable levers for guiding generation while preserving generative flexibility.$
Abstract
Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.
