Transcoders Beat Sparse Autoencoders for Interpretability
Gonçalo Paulo, Stepan Shabalin, Nora Belrose
TL;DR
The paper addresses interpretability of internal model components by comparing transcoders to sparse autoencoders and introducing skip transcoders. It demonstrates that skip transcoders achieve higher interpretability with lower reconstruction loss across multiple model families, enabling more effective circuit analysis. Evaluations using automated interpretability metrics and SAEBench reveal Pareto-dominance of skip transcoders over SAEs, supporting a shift in focus toward transcoders for mechanistic insight. Practical guidance and open-source code/resources are provided to facilitate adoption and further study.
Abstract
Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
