Table of Contents
Fetching ...

Transcoders Beat Sparse Autoencoders for Interpretability

Gonçalo Paulo, Stepan Shabalin, Nora Belrose

TL;DR

The paper addresses interpretability of internal model components by comparing transcoders to sparse autoencoders and introducing skip transcoders. It demonstrates that skip transcoders achieve higher interpretability with lower reconstruction loss across multiple model families, enabling more effective circuit analysis. Evaluations using automated interpretability metrics and SAEBench reveal Pareto-dominance of skip transcoders over SAEs, supporting a shift in focus toward transcoders for mechanistic insight. Practical guidance and open-source code/resources are provided to facilitate adoption and further study.

Abstract

Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.

Transcoders Beat Sparse Autoencoders for Interpretability

TL;DR

The paper addresses interpretability of internal model components by comparing transcoders to sparse autoencoders and introducing skip transcoders. It demonstrates that skip transcoders achieve higher interpretability with lower reconstruction loss across multiple model families, enabling more effective circuit analysis. Evaluations using automated interpretability metrics and SAEBench reveal Pareto-dominance of skip transcoders over SAEs, supporting a shift in focus toward transcoders for mechanistic insight. Practical guidance and open-source code/resources are provided to facilitate adoption and further study.

Abstract

Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.

Paper Structure

This paper contains 11 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Skip transcoders are a Pareto improvement on interpretability vs performance degradation. We compare the increase in cross-entropy loss of 3 different sizes of SAEs and transcoders, 32768 (top right), 65536 (bottom left) and 131072 (bottom right), when patched into the model. For all sizes, skip transcoders are better than transcoders and sparse autoencoders, having both lower increase in model loss and a higher average auto interpretability score. On each quadrant we show 3 models that were trained with a different number of active latents, 32, 64 and 128, except for the 65536 latent model, which only has 32 and 64. The auto interp score is defined as the average fuzzing and detection score of c.a. 500 latents.
  • Figure 2: Interpretability of latents and generalization of explanations. The interpretability scores of both detection and fuzzing are higher for skip transcoders and transcoders when compared to SAEs, with the distribution being wider for SAEs. Dots in the left plot indicate the average score. The accuracy of the explanations on examples sampled from different quantiles of the activation distribution we can observe that The accuracy of explanations remains higher even for lower quantiles, where the activations are smaller, showing that transcoder and skip-transcoder latents are probably representing more monosemantic concepts along the full distribution.
  • Figure 3: Comparison of feature density bricken2023towards and the consistent activation heuristic (sum of activations over all tokens divided by the number of tokens). These plots show that STs and SSTs are similar in terms of feature density and have less high-density features and more low-density features. This is not a problem because there exist methods for getting rid of low-density features bricken2023towardsjermyn24ghostgradsgao2024scaling, but not for regularizing high-density features.