Table of Contents
Fetching ...

Route Sparse Autoencoder to Interpret Large Language Models

Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan He

TL;DR

This work addresses the limited ability of traditional sparse autoencoders to capture features that span multiple layers in LLMs. It introduces RouteSAE, a routing-enabled, shared TopK SAE that dynamically weights multi-layer activations and feeds them into a unified feature space, improving interpretability with minimal parameter overhead. Empirical results on Llama-3.2-1B-Instruct show significant gains in both the number of interpretable features and interpretability scores, along with robust feature steering capabilities. The approach advances mechanistic interpretability by enabling scalable, cross-layer feature discovery and targeted interventions in LLM activations.

Abstract

Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at https://github.com/swei2001/RouteSAEs.

Route Sparse Autoencoder to Interpret Large Language Models

TL;DR

This work addresses the limited ability of traditional sparse autoencoders to capture features that span multiple layers in LLMs. It introduces RouteSAE, a routing-enabled, shared TopK SAE that dynamically weights multi-layer activations and feeds them into a unified feature space, improving interpretability with minimal parameter overhead. Empirical results on Llama-3.2-1B-Instruct show significant gains in both the number of interpretable features and interpretability scores, along with robust feature steering capabilities. The approach advances mechanistic interpretability by enabling scalable, cross-layer feature discovery and targeted interventions in LLM activations.

Abstract

Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at https://github.com/swei2001/RouteSAEs.

Paper Structure

This paper contains 22 sections, 13 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Layer-wise normalized activation values for two features extracted by Topk SAE in pythia-160m. The low-level feature (visual media terms) exhibits high activation in early layers that gradually decreases in deeper layers. In contrast, the high-level feature (temporal expressions) shows increasing activation with depth, peaking in the later layers.
  • Figure 2: Comparison of vanilla single-layer SAE, Crosscoder, and RouteSAE. Most existing SAEs belong to the vanilla SAE category, where features are extracted from the activation of a single layer. Crosscoder relies on separate encoders and decoders for each layer. RouteSAE incorporates a lightweight router to dynamically integrate multi-layer residual stream activations.
  • Figure 3: RouteSAE employs a lightweight router to dynamically integrate activations from multiple residual stream layers, effectively disentangling them into a shared feature space. It enables the model to capture features across different layers --- low-level features such as "units of weight" and "Olympics" from shallow layers, and high-level features like "more [X] than [Y]" and "do everything [possible/in my power]" from deeper layers.
  • Figure 4: Pareto frontier of sparsity versus KL divergence. RouteSAE achieves a lower KL divergence at the same sparsity level.
  • Figure 5: Effect of threshold on feature interpretability in RouteSAE. (a) Increasing the threshold reduces the number of selected features. (b) Higher thresholds yield better interpretation scores across sparsity levels.
  • ...and 7 more figures