Table of Contents
Fetching ...

Tokenized SAEs: Disentangling SAE Reconstructions

Thomas Dooms, Daniel Wilhelm

TL;DR

This work shows that sparse auto-encoders trained for language often learn features tied to local token statistics, a consequence of training data imbalance. It introduces Tokenized SAEs, adding a per-token bias via a lookup table to disentangle token reconstruction from context reconstruction, improving reconstruction quality while producing sparser, more semantically meaningful features. Across GPT-2 small and preliminary Pythia-1.4B experiments, TSAEs yield better Pareto-frontier performance, faster training, and robustness to deeper models, suggesting a practical path to more interpretable and efficient mechanistic analyses. The approach highlights the importance of accounting for data distribution biases when interpreting learned representations and opens avenues for incorporating multi-token statistics in interpretability pipelines.

Abstract

Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work empirically shows that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. To reduce this behavior, we propose a method that disentangles token reconstruction from feature reconstruction. This improvement is achieved by introducing a per-token bias, which provides an enhanced baseline for interesting reconstruction. As a result, significantly more interesting features and improved reconstruction in sparse regimes are learned.

Tokenized SAEs: Disentangling SAE Reconstructions

TL;DR

This work shows that sparse auto-encoders trained for language often learn features tied to local token statistics, a consequence of training data imbalance. It introduces Tokenized SAEs, adding a per-token bias via a lookup table to disentangle token reconstruction from context reconstruction, improving reconstruction quality while producing sparser, more semantically meaningful features. Across GPT-2 small and preliminary Pythia-1.4B experiments, TSAEs yield better Pareto-frontier performance, faster training, and robustness to deeper models, suggesting a practical path to more interpretable and efficient mechanistic analyses. The approach highlights the importance of accounting for data distribution biases when interpreting learned representations and opens avenues for incorporating multi-token statistics in interpretability pipelines.

Abstract

Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work empirically shows that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. To reduce this behavior, we propose a method that disentangles token reconstruction from feature reconstruction. This improvement is achieved by introducing a per-token bias, which provides an enhanced baseline for interesting reconstruction. As a result, significantly more interesting features and improved reconstruction in sparse regimes are learned.

Paper Structure

This paper contains 31 sections, 4 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: In the OpenWebText corpus, particular $n$-grams are seen exponentially more often than others. Many combinations occur millions of times more than an arbitrary $n$-gram.
  • Figure 2: With increasing OpenWebText token frequency, the reconstruction MSE of unigrams in layers 5, 8, and 11 of the RES-JB SAE decreases. This indicates the SAE effectively memorizes the most common tokens. This effect is not as pronounced with bigrams, likely because they are composed of common unigrams and/or occupy unigram subspaces.
  • Figure 3: To memorize unigrams exactly and sparsely, the SAE represents each using a small subset of feature neurons that fire in response to the unigram. Due to the incorporation of prior token information, SAEs in later layers often also strongly memorize bigrams.
  • Figure 4: Illustrating experimental results, an individual feature neuron is activated when one of its associated $n$-grams is present. The most common tokens will occupy a full feature while less common tokens will share a feature. To maximize reconstruction, this sharing occurs between semantically similar tokens.
  • Figure 5: Measuring cosine similarity of hidden representations and a patched version which only includes the last $n$ tokens in GPT-2 small. Trigrams are generally an adequate approximation across the network.
  • ...and 13 more figures