Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu
TL;DR
This work tackles mechanistic interpretability in very large language models by deploying a comprehensive suite of 256 Sparse Autoencoders (SAEs) trained across every sublayer of Llama-3.1-8B-Base to extract sparse, interpretable features. It introduces key modifications to Top-K SAEs, including decoder-norm aware Top-K, JumpReLU post-processing, and a K-annealing schedule, and demonstrates disk-I/O–friendly training via mixed parallelism. The evaluation shows Top-K SAEs consistently improve sparsity without sacrificing reconstruction quality, while wider SAEs tend to yield better Pareto efficiency and can uncover genuinely new features; interpretability analyses using GPT-4o and manual review reveal a generally uniform feature geometry with some ultra-rare or non-interpretable cases. The project provides open-source SAE checkpoints and tooling, enabling broader adoption and accelerating mechanistic interpretability research in large-scale models.
Abstract
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at \url{https://github.com/OpenMOSS/Language-Model-SAEs}. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.
