Table of Contents
Fetching ...

Binary Sparse Coding for Interpretability

Lucia Quirke, Stepan Shabalin, Nora Belrose

TL;DR

This work investigates interpretability of neural activations by binarising sparse autoencoder features, introducing Binary Sparse Coders (BAEs) and Binary Transcoders (BTCs) that constrain activations to $\\{0,1\\}$. While binarisation improves unweighted interpretability and the monosemanticity of features, it increases reconstruction error and yields a surge of ultra-high-frequency, uninterpretable features, suggesting polysemanticity may be ineliminable. Training employs differentiable relaxations (sigmoid STE, Gumbel-Softmax, GroupMax) and evaluated via LLM-based sparse probing and ablation analyses across SmolLM2 variants. Overall, the paper presents a negative result for fully removing polysemanticity with binarisation, but highlights insights for future hybrid approaches and improved interpretability evaluation. The findings imply caution in deploying binary sparse representations and motivate exploring mixed architectures and refined evaluation protocols.

Abstract

Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse autoencoders (BAEs) and binary transcoders (BTCs), which constrain all activations to be zero or one. We find that binarisation significantly improves the interpretability and monosemanticity of the discovered features, while increasing reconstruction error. By eliminating the distinction between high and low activation strengths, we prevent uninterpretable information from being smuggled in through the continuous variation in feature activations. However, we also find that binarisation increases the number of uninterpretable ultra-high frequency features, and when interpretability scores are frequency-adjusted, the scores for continuous sparse coders are slightly better than those of binary ones. This suggests that polysemanticity may be an ineliminable property of neural activations.

Binary Sparse Coding for Interpretability

TL;DR

This work investigates interpretability of neural activations by binarising sparse autoencoder features, introducing Binary Sparse Coders (BAEs) and Binary Transcoders (BTCs) that constrain activations to . While binarisation improves unweighted interpretability and the monosemanticity of features, it increases reconstruction error and yields a surge of ultra-high-frequency, uninterpretable features, suggesting polysemanticity may be ineliminable. Training employs differentiable relaxations (sigmoid STE, Gumbel-Softmax, GroupMax) and evaluated via LLM-based sparse probing and ablation analyses across SmolLM2 variants. Overall, the paper presents a negative result for fully removing polysemanticity with binarisation, but highlights insights for future hybrid approaches and improved interpretability evaluation. The findings imply caution in deploying binary sparse representations and motivate exploring mixed architectures and refined evaluation protocols.

Abstract

Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse autoencoders (BAEs) and binary transcoders (BTCs), which constrain all activations to be zero or one. We find that binarisation significantly improves the interpretability and monosemanticity of the discovered features, while increasing reconstruction error. By eliminating the distinction between high and low activation strengths, we prevent uninterpretable information from being smuggled in through the continuous variation in feature activations. However, we also find that binarisation increases the number of uninterpretable ultra-high frequency features, and when interpretability scores are frequency-adjusted, the scores for continuous sparse coders are slightly better than those of binary ones. This suggests that polysemanticity may be an ineliminable property of neural activations.

Paper Structure

This paper contains 18 sections, 5 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Binary vs. continuous next token prediction cross-entropy loss increase over various values of $k$. Each model are trained on 20 billion tokens.
  • Figure 2: Unweighted fuzzing interpretability scores for binary vs. continuous skip-transcoders trained on SmolLM2-135M, with various values of $k$. By this metric, binary coders match or outperform continuous ones across all layers and values of $k$.
  • Figure 3: Frequency-weighted fuzzing interpretability scores for binary vs. continuous skip-transcoders trained on SmolLM2-135M, with various values of $k$. By this metric, binary coders are often less interpretable than continuous ones.
  • Figure 4: Activating examples from different deciles of a TopK skip-transcoder feature. The top activations appear in a coding context, while the lower activations appear in diverse contexts.
  • Figure 5: Activating examples for a binary skip-transcoder feature that achieved perfect auto-interpretability scores. The generated explanation is "The verb 'search' or its variants, often used in contexts of looking for something, someone, or information, and frequently associated with a sense of investigation, inquiry, or pursuit." Note that a counterexample to this explanation, a non-activating instance of "Search", appears mid-way through Example 111.
  • ...and 17 more figures