Partially Rewriting a Transformer in Natural Language
Gonçalo Paulo, Nora Belrose
TL;DR
The paper tackles mechanistic interpretability by attempting to partially rewrite a transformer using natural language explanations to define interpretable latent features. It trains a sparse transcoder to approximate a layer's MLP and uses an LLM-based simulator, guided by explanations, to predict neuron activations, with quantile normalization calibrating the predictions. Evaluation shows that the loss increase from these substitutions is close to zero-vector ablation, indicating that current explanations are not precise enough to preserve performance. The work highlights the need for more detailed, contrastive, and calibrated explanations to advance faithful latent-level rewrites in large language models.
Abstract
The greatest ambition of mechanistic interpretability is to completely rewrite deep neural networks in a format that is more amenable to human understanding, while preserving their behavior and performance. In this paper, we attempt to partially rewrite a large language model using simple natural language explanations. We first approximate one of the feedforward networks in the LLM with a wider MLP with sparsely activating neurons - a transcoder - and use an automated interpretability pipeline to generate explanations for these neurons. We then replace the first layer of this sparse MLP with an LLM-based simulator, which predicts the activation of each neuron given its explanation and the surrounding context. Finally, we measure the degree to which these modifications distort the model's final output. With our pipeline, the model's increase in loss is statistically similar to entirely replacing the sparse MLP output with the zero vector. We employ the same protocol, this time using a sparse autoencoder, on the residual stream of the same layer and obtain similar results. These results suggest that more detailed explanations are needed to improve performance substantially above the zero ablation baseline.
