Table of Contents
Fetching ...

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

Zhen Xu, Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen

TL;DR

Scale SAE tackles the interpretability-efficiency gap in LLM analysis by addressing polysemanticity with a diverse, specialized MoE-based sparse autoencoder. It introduces Multiple Expert Activation to promote specialization across experts and Feature Scaling to amplify high-frequency components for richer, monosemantic features. Empirical results show up to a 24% reduction in reconstruction error and a 99% decrease in feature redundancy compared with prior MoE-SAE methods, along with improved automated interpretability. This approach enables transparent inspection of LLM activations under a computationally feasible framework, advancing mechanistic interpretability for large language models.

Abstract

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM explanations, their practical adoption faces a fundamental challenge: better interpretability demands that SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs. Recent Mixture of Experts (MoE) approaches attempt to address this by partitioning SAEs into narrower expert networks with gated activation, thereby reducing computation. In a well-designed MoE, each expert should focus on learning a distinct set of features. However, we identify a \textit{critical limitation} in MoE-SAE: Experts often fail to specialize, which means they frequently learn overlapping or identical features. To deal with it, we propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling. Experiments demonstrate a 24\% lower reconstruction error and a 99\% reduction in feature redundancy compared to existing MoE-SAE methods. This work bridges the interpretability-efficiency gap in LLM analysis, allowing transparent model inspection without compromising computational feasibility.

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

TL;DR

Scale SAE tackles the interpretability-efficiency gap in LLM analysis by addressing polysemanticity with a diverse, specialized MoE-based sparse autoencoder. It introduces Multiple Expert Activation to promote specialization across experts and Feature Scaling to amplify high-frequency components for richer, monosemantic features. Empirical results show up to a 24% reduction in reconstruction error and a 99% decrease in feature redundancy compared with prior MoE-SAE methods, along with improved automated interpretability. This approach enables transparent inspection of LLM activations under a computationally feasible framework, advancing mechanistic interpretability for large language models.

Abstract

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM explanations, their practical adoption faces a fundamental challenge: better interpretability demands that SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs. Recent Mixture of Experts (MoE) approaches attempt to address this by partitioning SAEs into narrower expert networks with gated activation, thereby reducing computation. In a well-designed MoE, each expert should focus on learning a distinct set of features. However, we identify a \textit{critical limitation} in MoE-SAE: Experts often fail to specialize, which means they frequently learn overlapping or identical features. To deal with it, we propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling. Experiments demonstrate a 24\% lower reconstruction error and a 99\% reduction in feature redundancy compared to existing MoE-SAE methods. This work bridges the interpretability-efficiency gap in LLM analysis, allowing transparent model inspection without compromising computational feasibility.

Paper Structure

This paper contains 29 sections, 8 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Scale Sparse Autoencoder Architecture. An illustration of the three core mechanisms in the Scale SAE architecture. (a) Multiple Expert Activation. A router selects a subset of experts (e.g., 2 out of 3 shown) to process each input. (b) Global Top-K Activation. The activations from the selected experts are aggregated, and a global Top-K operation (K=3 shown) is applied to enforce sparsity. (c) Feature Scaling. The encoder weights of each expert are decomposed and scaled to dynamically amplify high-frequency components.
  • Figure 2: The scaling law for the trained scale factor $\omega$.
  • Figure 3: Performance comparison of Scale SAE against baseline models across three key metrics. (a, b) Reconstruction MSE on the OpenWebText and HLE-Biomedical datasets, respectively. (c) Loss Recovered on the HLE-Biomedical dataset. (d) Automated Interpretability Score on the OpenWebText2 dataset.
  • Figure 4: Performance comparison across two key metrics and distinct data domains, plotted as a function of the number of activated experts. (a, b) Reconstruction MSE was evaluated on the general-domain OpenWebText and the specialized HLE-Biomedical datasets, respectively. (c, d) Loss Recovered was evaluated on the same two datasets.
  • Figure 5: The impact of Feature Scaling across different model settings. (a) Effect of Feature Scaling as a function of the number of activated experts, shown for fixed sparsity levels ($L_0\in{2,32}$). (b) Effect of Feature Scaling as a function of target sparsity, shown for fixed expert setups ($e\in{8,16}$).
  • ...and 6 more figures