Binary Sparse Coding for Interpretability
Lucia Quirke, Stepan Shabalin, Nora Belrose
TL;DR
This work investigates interpretability of neural activations by binarising sparse autoencoder features, introducing Binary Sparse Coders (BAEs) and Binary Transcoders (BTCs) that constrain activations to $\\{0,1\\}$. While binarisation improves unweighted interpretability and the monosemanticity of features, it increases reconstruction error and yields a surge of ultra-high-frequency, uninterpretable features, suggesting polysemanticity may be ineliminable. Training employs differentiable relaxations (sigmoid STE, Gumbel-Softmax, GroupMax) and evaluated via LLM-based sparse probing and ablation analyses across SmolLM2 variants. Overall, the paper presents a negative result for fully removing polysemanticity with binarisation, but highlights insights for future hybrid approaches and improved interpretability evaluation. The findings imply caution in deploying binary sparse representations and motivate exploring mixed architectures and refined evaluation protocols.
Abstract
Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse autoencoders (BAEs) and binary transcoders (BTCs), which constrain all activations to be zero or one. We find that binarisation significantly improves the interpretability and monosemanticity of the discovered features, while increasing reconstruction error. By eliminating the distinction between high and low activation strengths, we prevent uninterpretable information from being smuggled in through the continuous variation in feature activations. However, we also find that binarisation increases the number of uninterpretable ultra-high frequency features, and when interpretability scores are frequency-adjusted, the scores for continuous sparse coders are slightly better than those of binary ones. This suggests that polysemanticity may be an ineliminable property of neural activations.
