Table of Contents
Fetching ...

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Charles O'Neill, Thang Bui

TL;DR

This work tackles scalable mechanistic interpretability for large language models by proposing discrete sparse autoencoders (SAEs) trained on task-specific positive and negative examples to identify circuits implemented by attention heads. By discretizing head activations into integer codes, the method directly flags heads and head-pairs that implement circuit-specific computations, enabling fast node-level and edge-level circuit identification with only 5–10 example prompts. Across IOI, Greater-than, and Docstring tasks, the SAE-based approach achieves higher or comparable precision and recall relative to state-of-the-art baselines while dramatically reducing runtime, and it remains robust to hyperparameter choices and dataset size. The identified circuits often match or exceed full-model performance on target metrics, demonstrating the potential for scalable, interpretable mechanistic analysis without extensive ablations or architectural changes.

Abstract

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational complexity and sensitivity to hyperparameters. We propose training sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples. We hypothesise that learned representations of attention head outputs will signal when a head is engaged in specific computations. By discretising the learned representations into integer codes and measuring the overlap between codes unique to positive examples for each head, we enable direct identification of attention heads involved in circuits without the need for expensive ablations or architectural modifications. On three well-studied tasks - indirect object identification, greater-than comparisons, and docstring completion - the proposed method achieves higher precision and recall in recovering ground-truth circuits compared to state-of-the-art baselines, while reducing runtime from hours to seconds. Notably, we require only 5-10 text examples for each task to learn robust representations. Our findings highlight the promise of discrete sparse autoencoders for scalable and efficient mechanistic interpretability, offering a new direction for analysing the inner workings of large language models.

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

TL;DR

This work tackles scalable mechanistic interpretability for large language models by proposing discrete sparse autoencoders (SAEs) trained on task-specific positive and negative examples to identify circuits implemented by attention heads. By discretizing head activations into integer codes, the method directly flags heads and head-pairs that implement circuit-specific computations, enabling fast node-level and edge-level circuit identification with only 5–10 example prompts. Across IOI, Greater-than, and Docstring tasks, the SAE-based approach achieves higher or comparable precision and recall relative to state-of-the-art baselines while dramatically reducing runtime, and it remains robust to hyperparameter choices and dataset size. The identified circuits often match or exceed full-model performance on target metrics, demonstrating the potential for scalable, interpretable mechanistic analysis without extensive ablations or architectural changes.

Abstract

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational complexity and sensitivity to hyperparameters. We propose training sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples. We hypothesise that learned representations of attention head outputs will signal when a head is engaged in specific computations. By discretising the learned representations into integer codes and measuring the overlap between codes unique to positive examples for each head, we enable direct identification of attention heads involved in circuits without the need for expensive ablations or architectural modifications. On three well-studied tasks - indirect object identification, greater-than comparisons, and docstring completion - the proposed method achieves higher precision and recall in recovering ground-truth circuits compared to state-of-the-art baselines, while reducing runtime from hours to seconds. Notably, we require only 5-10 text examples for each task to learn robust representations. Our findings highlight the promise of discrete sparse autoencoders for scalable and efficient mechanistic interpretability, offering a new direction for analysing the inner workings of large language models.
Paper Structure (66 sections, 13 equations, 36 figures, 7 tables)

This paper contains 66 sections, 13 equations, 36 figures, 7 tables.

Figures (36)

  • Figure 1: After training the sparse autoencoder, we obtain discrete representations $\mathbf{z}$ by passing tensor $\mathbf{x}$ and taking the argmax over the feature dimension, obtaining an integer code for each head in each example in $\hat{\mathbf{z}}$. $b$ is the number of examples, $h$ is the number of heads, $d$ is the transformer hidden dimension and $n$ is the number of learned features. For node-level circuit identification, shown here, we compute the number of codes unique to positive examples per head, normalise with softmax, choose a threshold $\theta$, and identify a head as being in the circuit if it surpasses the threshold. For edge-level circuit identification, shown in Figure \ref{['fig:edge_method']}, we count the number of co-occurrences of codes between heads for the top-$k$ co-occurrences, and then again take the softmax and thresholding with $\theta$.
  • Figure 2: Comparison of our method's performance against state-of-the-art circuit identification techniques (ACDC, HISP, and SP) on three well-studied transformer circuits: Docstring, Greater-than, and Indirect Object Identification (IOI). The bar plots show the average AUC (Area Under the ROC Curve) scores for each method, averaged across KL divergence and loss metrics, for both edge-level and node-level circuit identification. Error bars for Ours represent the standard deviation of AUC scores across 5 runs. Our method consistently outperforms or matches the performance of existing techniques across all tasks.
  • Figure 3: Mean ROC AUC scores across different values of the number of SAE features and sparsity penalty $\lambda$.
  • Figure 4: F1 score (node-level) for each dataset given a threshold $\theta$ for selecting a head's importance (after softmax). The optimal threshold is approximately the same for both IOI and Greater-than tasks.
  • Figure 5: Faithfulness of our learned circuits, circuits from edge attribution patching (EAP), and randomly selected circuits of equivalent size for the (a) IOI and (b) Greater-than tasks. Our circuits quickly approach or surpass the full model's performance as attention heads are added in order of importance. We outperform or match the performance of EAP at all thresholds for all metrics. Faithfulness of 1 indicates complete agreement with the unablated model.
  • ...and 31 more figures