Transcoders Find Interpretable LLM Feature Circuits

Jacob Dunefsky; Philippe Chlenski; Neel Nanda

Transcoders Find Interpretable LLM Feature Circuits

Jacob Dunefsky, Philippe Chlenski, Neel Nanda

TL;DR

The paper tackles the challenge of mechanistic interpretability in transformer MLP sublayers, where neuron-level analysis is hindered by polysemanticity. It introduces transcoders—wide, sparsely activated alternatives to MLP sublayers—that are trained to faithfully mimic MLP outputs while enforcing sparsity, enabling input-invariant, feature-level circuit analysis. A comprehensive circuit-analysis framework is developed to decompose computations into input-invariant and input-dependent components, with attribution through transcoder feature pairs and attention OV circuits, plus a greedy path-finding algorithm to assemble subgraphs. Across GPT2-small, Pythia-410M, and Pythia-1.4B, transcoders demonstrate parity or superiority to SAEs in sparsity and fidelity, and the authors showcase several case studies, including blind reverse-engineering and the GPT2-small greater-than circuit, providing novel insights via de-embeddings. The work presents transcoders as a scalable, interpretable tool for dissecting MLP computations in transformers, with potential implications for debugging and controlling model behavior.

Abstract

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. We then successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the "greater-than circuit" in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at https://github.com/jacobdunefsky/transcoder_circuits/.

Transcoders Find Interpretable LLM Feature Circuits

TL;DR

Abstract

Paper Structure (62 sections, 21 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 62 sections, 21 equations, 6 figures, 4 tables, 2 algorithms.

Introduction
Our contributions.
Transformers preliminaries
Transcoders
Architecture and training
Circuit analysis with transcoders
Attribution between transcoder feature pairs
Attribution through attention heads
Finding computational subgraphs
De-embeddings: a special case of input-invariant information
Comparison with SAEs
Blind interpretability comparison of transcoders to SAEs
Quantitative comparison of transcoders to SAEs
Evaluation metrics
Results
...and 47 more sections

Figures (6)

Figure 1: A comparison between SAEs, MLP transcoders, and MLP sublayers for a transformer-based language model. SAEs learn to reconstruct model activations, whereas transcoders imitate sublayers' input-output behavior.
Figure 2: A visualization of the circuit-finding algorithm.
Figure 3: The sparsity-accuracy tradeoff of transcoders versus SAEs on GPT2-small, Pythia-410M, and Pythia-1.4B. Each point corresponds to a trained SAE or transcoder, and is labeled with the L1 regularization penalty $\lambda_1$ used during training.
Figure 4: Left: Performance according to the probability difference metric when all but the top $k$features or neurons in MLP10 are zero-ablated. Right: The DLA and de-embedding score for tc10[5315], which contributed negatively to the transcoder's performance.
Figure 5: For the three MLP10 transcoder features with the highest activation variance over the "greater-than" dataset, and for every possible YY token, we plot the direct logit attribution (the extent to which the feature boosts the output probability of YY) and the de-embedding score (an input-invariant measurement of how much YY causes the feature to fire).
...and 1 more figures

Transcoders Find Interpretable LLM Feature Circuits

TL;DR

Abstract

Transcoders Find Interpretable LLM Feature Circuits

Authors

TL;DR

Abstract

Table of Contents

Figures (6)