Table of Contents
Fetching ...

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Xudong Zhu, Mohammad Mahdi Khalili, Zhihui Zhu

TL;DR

This paper targets interpretability of large language models through sparse autoencoders (SAEs) and presents a unified proximal-gradient framework that links encoders to proximal operators. It identifies a core limitation: non-negativity constraints fragment bidirectional semantic axes, and introduces AbsTopK, an $ ext{l}_0$-based SAE that preserves both positive and negative activations to capture bipolar concepts within a single feature. The authors provide theoretical grounding by mapping activation functions (ReLU, JumpReLU, TopK) to proximal maps, and demonstrate, via extensive experiments across four LLMs and seven probing/steering tasks, that AbsTopK yields higher reconstruction fidelity and richer interpretability, often matching or surpassing supervised Difference-in-Mean methods. AbsTopK's bidirectional representations enable more robust interventions with better tradeoffs between steering and general capabilities, suggesting a practical path toward principled mechanistic interpretability in large-scale models. The work also points to future directions for efficient neural approximations of the $ ext{l}_0$ operator to scale the approach to even larger models.

Abstract

Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there remains no principled framework to derive SAEs from the original dictionary learning formulation. In this work, we introduce such a framework by unrolling the proximal gradient method for sparse coding. We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation of existing SAEs: their sparsity-inducing regularizers enforce non-negativity, preventing a single feature from representing bidirectional concepts (e.g., male vs. female). This structural constraint fragments semantic axes into separate, redundant features, limiting representational completeness. To address this issue, we propose AbsTopK SAE, a new variant derived from the $\ell_0$ sparsity constraint that applies hard thresholding over the largest-magnitude activations. By preserving both positive and negative activations, AbsTopK uncovers richer, bidirectional conceptual representations. Comprehensive experiments across four LLMs and seven probing and steering tasks show that AbsTopK improves reconstruction fidelity, enhances interpretability, and enables single features to encode contrasting concepts. Remarkably, AbsTopK matches or even surpasses the Difference-in-Mean method, a supervised approach that requires labeled data for each concept and has been shown in prior work to outperform SAEs.

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

TL;DR

This paper targets interpretability of large language models through sparse autoencoders (SAEs) and presents a unified proximal-gradient framework that links encoders to proximal operators. It identifies a core limitation: non-negativity constraints fragment bidirectional semantic axes, and introduces AbsTopK, an -based SAE that preserves both positive and negative activations to capture bipolar concepts within a single feature. The authors provide theoretical grounding by mapping activation functions (ReLU, JumpReLU, TopK) to proximal maps, and demonstrate, via extensive experiments across four LLMs and seven probing/steering tasks, that AbsTopK yields higher reconstruction fidelity and richer interpretability, often matching or surpassing supervised Difference-in-Mean methods. AbsTopK's bidirectional representations enable more robust interventions with better tradeoffs between steering and general capabilities, suggesting a practical path toward principled mechanistic interpretability in large-scale models. The work also points to future directions for efficient neural approximations of the operator to scale the approach to even larger models.

Abstract

Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there remains no principled framework to derive SAEs from the original dictionary learning formulation. In this work, we introduce such a framework by unrolling the proximal gradient method for sparse coding. We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation of existing SAEs: their sparsity-inducing regularizers enforce non-negativity, preventing a single feature from representing bidirectional concepts (e.g., male vs. female). This structural constraint fragments semantic axes into separate, redundant features, limiting representational completeness. To address this issue, we propose AbsTopK SAE, a new variant derived from the sparsity constraint that applies hard thresholding over the largest-magnitude activations. By preserving both positive and negative activations, AbsTopK uncovers richer, bidirectional conceptual representations. Comprehensive experiments across four LLMs and seven probing and steering tasks show that AbsTopK improves reconstruction fidelity, enhances interpretability, and enables single features to encode contrasting concepts. Remarkably, AbsTopK matches or even surpasses the Difference-in-Mean method, a supervised approach that requires labeled data for each concept and has been shown in prior work to outperform SAEs.

Paper Structure

This paper contains 37 sections, 1 theorem, 29 equations, 4 figures, 2 tables.

Key Result

Lemma 1

Denote by $\operatorname{ReLU}_\lambda,\operatorname{JumpReLU}_\theta, \operatorname{TopK}_k$ as the following operators: where $\mathcal{T}_k(\boldsymbol{u})$ denotes the set of indices corresponding to the $k$ largest entriesIn case $k$ largest components are not uniquely defined, one can choose among them—for example, by selecting the components with the smallest indices—to ensure exactly $k$

Figures (4)

  • Figure 1: AbsTopK enables single latent features to encode opposing concepts by leveraging both positive and negative activations. To test this, we generated controlled sentence pairs with only one differing token (man vs. woman). The shown feature activates positively for man and negatively for woman, demonstrating bidirectional encoding. Unlike conventional SAEs, which are restricted by a non-negativity constraint, AbsTopK more compactly captures opposing semantics within a single dimension, yielding richer and more coherent representations.
  • Figure 2: Performance comparison of JumpReLU, TopK, and AbsTopK SAEs on Qwen3 4B Layer 20, showing (a) MSE Training Loss, (b) Normalized MSE, and (c) Loss Recovered. Additional results across models and layers are provided in Appendix \ref{['app:core_metrics']}.
  • Figure 3: Performance comparison of SAE variants (TopK, AbsTopK, and JumpReLU) across tasks on Qwen3-4B Layer 18. For all tasks, higher scores indicate better performance; the Unlearning and Absorption scores have been transformed as $1-$original score to maintain this consistency. For more details, see Appendix \ref{['appendix:sae_task_details']}.
  • Figure 4: Performance comparison of JumpReLU, TopK, and AbsTopK SAEs on all other models and layers, showing (a) MSE Training Loss, (b) Normalized MSE, and (c) Loss Recovered.

Theorems & Definitions (2)

  • Lemma 1
  • proof