Table of Contents
Fetching ...

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu

TL;DR

This work tackles mechanistic interpretability in very large language models by deploying a comprehensive suite of 256 Sparse Autoencoders (SAEs) trained across every sublayer of Llama-3.1-8B-Base to extract sparse, interpretable features. It introduces key modifications to Top-K SAEs, including decoder-norm aware Top-K, JumpReLU post-processing, and a K-annealing schedule, and demonstrates disk-I/O–friendly training via mixed parallelism. The evaluation shows Top-K SAEs consistently improve sparsity without sacrificing reconstruction quality, while wider SAEs tend to yield better Pareto efficiency and can uncover genuinely new features; interpretability analyses using GPT-4o and manual review reveal a generally uniform feature geometry with some ultra-rare or non-interpretable cases. The project provides open-source SAE checkpoints and tooling, enabling broader adoption and accelerating mechanistic interpretability research in large-scale models.

Abstract

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at \url{https://github.com/OpenMOSS/Language-Model-SAEs}. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

TL;DR

This work tackles mechanistic interpretability in very large language models by deploying a comprehensive suite of 256 Sparse Autoencoders (SAEs) trained across every sublayer of Llama-3.1-8B-Base to extract sparse, interpretable features. It introduces key modifications to Top-K SAEs, including decoder-norm aware Top-K, JumpReLU post-processing, and a K-annealing schedule, and demonstrates disk-I/O–friendly training via mixed parallelism. The evaluation shows Top-K SAEs consistently improve sparsity without sacrificing reconstruction quality, while wider SAEs tend to yield better Pareto efficiency and can uncover genuinely new features; interpretability analyses using GPT-4o and manual review reveal a generally uniform feature geometry with some ultra-rare or non-interpretable cases. The project provides open-source SAE checkpoints and tooling, enabling broader adoption and accelerating mechanistic interpretability research in large-scale models.

Abstract

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at \url{https://github.com/OpenMOSS/Language-Model-SAEs}. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.

Paper Structure

This paper contains 52 sections, 9 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Four potential training positions in one Transformer Block.
  • Figure 2: Explained Variance (upper) and Delta LM loss (lower) over L0 sparsity for SAEs trained on L7R, L15R and L23R.
  • Figure 3: Automatically labeled monosemanticity scores of L15R-8x SAE features.
  • Figure 4: Firing frequency of L7R-8x, L15R-8x and L23R-8x TopK SAEs.
  • Figure 5: SAE performance on long context data, measured by MSE and L0 sparsity.
  • ...and 7 more figures