A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du
TL;DR
The paper surveys Sparse Autoencoders (SAEs) as a promising approach to mechanistic interpretability in large language models (LLMs), focusing on disentangling polysemantic activations into monosemantic features. It details the technical framework of SAEs (encoder–decoder with an overcomplete, sparse activation) and presents a taxonomy of architectural and training-strategy variants. It reviews explainability analyses, separating input-based and output-based explanations, and outlines a dual-criteria evaluation framework combining structural fidelity with functional interpretability and robustness. It also discusses real-world applications in model steering and behavior analysis, and candidly addresses challenges such as data and compute demands, incomplete concept dictionaries, and the need for stronger theoretical foundations. It concludes that while SAEs enable clearer insights into LLM internals, further methodological advances are needed for scalable and reliable deployment.
Abstract
Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive survey of SAEs for interpreting and understanding the internal workings of LLMs. Our major contributions include: (1) exploring the technical framework of SAEs, covering basic architecture, design improvements, and effective training strategies; (2) examining different approaches to explaining SAE features, categorized into input-based and output-based explanation methods; (3) discussing evaluation methods for assessing SAE performance, covering both structural and functional metrics; and (4) investigating real-world applications of SAEs in understanding and manipulating LLM behaviors.
