Table of Contents
Fetching ...

A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Jiaqing Xie

TL;DR

This work tackles steering LLMs via internal activations by identifying that many top-$K$ SAE latents are non-semantic; it proposes using the single most relevant latent ($top-1$) and a token-wise decaying steering schedule with $oldsymbol{\alpha}_t$ to improve stability and comparability with MeanActDiff. The authors introduce a differential SAE feature extraction pipeline and a decaying steering mechanism, showing that SAE steering can reliably elicit step-by-step reasoning in math tasks and provides finer control in formatting tasks; it matches MeanActDiff on IF-Eval benchmarks and often outperforms MeanActDiff in several reasoning scenarios, though Chain-of-Thought prompting remains the performance ceiling. The method offers interpretability through an SAE feature dictionary and demonstrates that steering behaves similarly to appending a guiding token in some cases, while also highlighting task-dependent strengths and limitations across multilingual instruction-following. Overall, the approach enables a more principled, fair comparison of SAE-based steering to MeanActDiff, with practical implications for controllable generation in reasoning and formatting tasks. $top-1$, $oldsymbol{\alpha}_t$, and SAE feature dynamics provide interpretable levers for guiding generation while preserving generative flexibility.$

Abstract

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.

A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

TL;DR

This work tackles steering LLMs via internal activations by identifying that many top- SAE latents are non-semantic; it proposes using the single most relevant latent () and a token-wise decaying steering schedule with to improve stability and comparability with MeanActDiff. The authors introduce a differential SAE feature extraction pipeline and a decaying steering mechanism, showing that SAE steering can reliably elicit step-by-step reasoning in math tasks and provides finer control in formatting tasks; it matches MeanActDiff on IF-Eval benchmarks and often outperforms MeanActDiff in several reasoning scenarios, though Chain-of-Thought prompting remains the performance ceiling. The method offers interpretability through an SAE feature dictionary and demonstrates that steering behaves similarly to appending a guiding token in some cases, while also highlighting task-dependent strengths and limitations across multilingual instruction-following. Overall, the approach enables a more principled, fair comparison of SAE-based steering to MeanActDiff, with practical implications for controllable generation in reasoning and formatting tasks. , , and SAE feature dynamics provide interpretable levers for guiding generation while preserving generative flexibility.$

Abstract

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.

Paper Structure

This paper contains 64 sections, 15 equations, 25 figures, 32 tables, 2 algorithms.

Figures (25)

  • Figure 1: Mean Activation Difference (MeanActDiff) produces incorrect answers in math reasoning cases, while constant SAE steering often leads to word repetitions or wrong outputs. Our proposed decaying strategy applies the SAE steering vector progressively, yielding both step-by-step reasoning and correct answers, and enabling a fair comparison with MeanActDiff.
  • Figure 2: Pipeline of SAE feature extraction and token-wise decaying steering applied to question answering in German.
  • Figure 3: Performance plots for JSON formating, lowercase and uppercases.
  • Figure 4: Top-Activated SAE features are almost the same in word inclusion cases.
  • Figure 5: Ablation Studies
  • ...and 20 more figures