Table of Contents
Fetching ...

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du

TL;DR

The paper surveys Sparse Autoencoders (SAEs) as a promising approach to mechanistic interpretability in large language models (LLMs), focusing on disentangling polysemantic activations into monosemantic features. It details the technical framework of SAEs (encoder–decoder with an overcomplete, sparse activation) and presents a taxonomy of architectural and training-strategy variants. It reviews explainability analyses, separating input-based and output-based explanations, and outlines a dual-criteria evaluation framework combining structural fidelity with functional interpretability and robustness. It also discusses real-world applications in model steering and behavior analysis, and candidly addresses challenges such as data and compute demands, incomplete concept dictionaries, and the need for stronger theoretical foundations. It concludes that while SAEs enable clearer insights into LLM internals, further methodological advances are needed for scalable and reliable deployment.

Abstract

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive survey of SAEs for interpreting and understanding the internal workings of LLMs. Our major contributions include: (1) exploring the technical framework of SAEs, covering basic architecture, design improvements, and effective training strategies; (2) examining different approaches to explaining SAE features, categorized into input-based and output-based explanation methods; (3) discussing evaluation methods for assessing SAE performance, covering both structural and functional metrics; and (4) investigating real-world applications of SAEs in understanding and manipulating LLM behaviors.

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

TL;DR

The paper surveys Sparse Autoencoders (SAEs) as a promising approach to mechanistic interpretability in large language models (LLMs), focusing on disentangling polysemantic activations into monosemantic features. It details the technical framework of SAEs (encoder–decoder with an overcomplete, sparse activation) and presents a taxonomy of architectural and training-strategy variants. It reviews explainability analyses, separating input-based and output-based explanations, and outlines a dual-criteria evaluation framework combining structural fidelity with functional interpretability and robustness. It also discusses real-world applications in model steering and behavior analysis, and candidly addresses challenges such as data and compute demands, incomplete concept dictionaries, and the need for stronger theoretical foundations. It concludes that while SAEs enable clearer insights into LLM internals, further methodological advances are needed for scalable and reliable deployment.

Abstract

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive survey of SAEs for interpreting and understanding the internal workings of LLMs. Our major contributions include: (1) exploring the technical framework of SAEs, covering basic architecture, design improvements, and effective training strategies; (2) examining different approaches to explaining SAE features, categorized into input-based and output-based explanation methods; (3) discussing evaluation methods for assessing SAE performance, covering both structural and functional metrics; and (4) investigating real-world applications of SAEs in understanding and manipulating LLM behaviors.

Paper Structure

This paper contains 30 sections, 29 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: (a) This figure illustrates the fundamental framework of a Sparse Autoencoder (SAE). SAE is trained to take a model representation $\mathbf{z}$ as input and project it to an overcomplete sparse activation $h(\mathbf{z})$ by learning to reconstruct the original input $\hat{\mathbf{z}}$. The SAE typically comprises an encoder, a decoder, and a loss function for training. (b) The development of the SAE progresses through multiple stages. Note that we only list some representative SAE models in this timeline rather than providing an exhaustive compilation.
  • Figure 2: The figure illustrates the interpretation of a learned SAE feature using VocabProj and MaxAct. VocabProj lists words with the highest logits in "Positive Logits" column, and lowest logits in "Negative Logits" column. The upper histogram in Statistical Analysis shows the distribution of randomly sampled non-zero activations, with the y-axis representing the number of sampled activations and the x-axis indicating activation scores. The lower histogram depicts the logit density, where the y-axis represents the number of tokens and the x-axis corresponds to logit scores. MaxAct highlights tokens in an input text that strongly activate the learned feature. The figure references the Neuronpedia website neuronpedia.
  • Figure 3: The figure illustrates the process of using a SAE to steer the behavior of a LLM, with an example of the resulting steered output. In part (a), normally people use SAE to extract a steering vector by comparing two representations: $\mathbf{z}$, which lacks a certain feature, and $\mathbf{z'}$, which contains that feature. In part (b), this steering vector is added to the input representation, modifying the LLM’s behavior to align with the desired feature. Part (c) demonstrates the example results of this process, where the steered output reflects the steered feature, even when the original input prompt is neutral or contradictory to the feature being introduced.