Table of Contents
Fetching ...

Evaluating Sparse Autoencoders for Monosemantic Representation

Moghis Fereidouni, Muhammad Umair Haider, Peizhong Ju, A. B. Siddique

TL;DR

This work addresses polysemanticity in LLMs by evaluating Sparse Autoencoders as a route to monosemantic representations. It introduces a distribution-aware concept separability score based on the Jensen–Shannon distance and demonstrates that SAEs increase separability and reduce polysemanticity across two large models and five datasets. The paper also proposes APP, a posterior-probability based attenuation method, showing that distribution-aware, partial interventions enable more precise concept removal with minimal language-model degradation, especially when combined with SAE representations. Through cross-model validation, the results indicate that SAEs enhance interpretability and controllability of concept-level behavior, with practical implications for safe and transparent AI systems.

Abstract

A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, no quantitative comparison has examined how concept activation distributions differ between SAEs and their base models. This paper provides the first systematic evaluation of SAEs against base models through activation distribution lens. We introduce a fine-grained concept separability score based on the Jensen-Shannon distance, which captures how distinctly a neuron's activation distributions vary across concepts. Using two large language models (Gemma-2-2B and DeepSeek-R1) and multiple SAE variants across five datasets (including word-level and sentence-level), we show that SAEs reduce polysemanticity and achieve higher concept separability. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP achieves the smallest perplexity increase while remaining highly effective at concept removal.

Evaluating Sparse Autoencoders for Monosemantic Representation

TL;DR

This work addresses polysemanticity in LLMs by evaluating Sparse Autoencoders as a route to monosemantic representations. It introduces a distribution-aware concept separability score based on the Jensen–Shannon distance and demonstrates that SAEs increase separability and reduce polysemanticity across two large models and five datasets. The paper also proposes APP, a posterior-probability based attenuation method, showing that distribution-aware, partial interventions enable more precise concept removal with minimal language-model degradation, especially when combined with SAE representations. Through cross-model validation, the results indicate that SAEs enhance interpretability and controllability of concept-level behavior, with practical implications for safe and transparent AI systems.

Abstract

A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, no quantitative comparison has examined how concept activation distributions differ between SAEs and their base models. This paper provides the first systematic evaluation of SAEs against base models through activation distribution lens. We introduce a fine-grained concept separability score based on the Jensen-Shannon distance, which captures how distinctly a neuron's activation distributions vary across concepts. Using two large language models (Gemma-2-2B and DeepSeek-R1) and multiple SAE variants across five datasets (including word-level and sentence-level), we show that SAEs reduce polysemanticity and achieve higher concept separability. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP achieves the smallest perplexity increase while remaining highly effective at concept removal.

Paper Structure

This paper contains 25 sections, 12 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: SAEs reduce neuron overlap in comparison to the base model, indicating lower polysemanticity. Higher-capacity SAEs (65k) further reduce overlap, suggesting more effective assignment of distinct neurons to separate concepts.
  • Figure 2: Across base model and SAEs (SAE-16k, SAE-65k), neurons exhibit varying degrees of separability in their activations. Some have completely overlapping activations across concepts, others show partial or clear separation. This variability underscores the importance of using distribution-aware metrics when assessing monosemanticity.
  • Figure 3: Separability Score vs. Erasure Ability (Partial)
  • Figure 4: Separability Score vs. Erasure Ability (Full)
  • Figure 5: Across base model and SAEs (SAE-16k, SAE-65k), neurons exhibit varying degrees of separability in their activations. Some have completely overlapping activations across concepts, others show partial or clear separation. This variability underscores the importance of using distribution-aware metrics when assessing neuron monosemanticity.
  • ...and 2 more figures