Table of Contents
Fetching ...

The Knowledge Microscope: Features as Better Analytical Lenses than Neurons

Yuheng Chen, Pengfei Cao, Kang Liu, Jun Zhao

TL;DR

The paper addresses polysemanticity in neuron-based representations of factual knowledge in language models and proposes decomposing neurons into features via Sparse Autoencoders (SAE) as alternative analytical units. It shows that feature units, particularly post-MLP residual features, have greater impact on knowledge expression ($\Delta Prob$) and superior interpretability ($IS$) than neurons, and they exhibit stronger monosemanticity. A feature-based Knowledge Erasure method, FeatureEdit, outperforms neuron-based approaches on the PrivacyParaRel dataset across Rel, Gen, Loc, and $\Delta$PPL. The work uses Gemma Scope SAEs on Gemma-2 models and the ParaRel privacy dataset to demonstrate results and provides practical privacy-preserving editing implications, advocating a shift to feature-based mechanistic interpretability in LMs.

Abstract

Previous studies primarily utilize MLP neurons as units of analysis for understanding the mechanisms of factual knowledge in Language Models (LMs); however, neurons suffer from polysemanticity, leading to limited knowledge expression and poor interpretability. In this paper, we first conduct preliminary experiments to validate that Sparse Autoencoders (SAE) can effectively decompose neurons into features, which serve as alternative analytical units. With this established, our core findings reveal three key advantages of features over neurons: (1) Features exhibit stronger influence on knowledge expression and superior interpretability. (2) Features demonstrate enhanced monosemanticity, showing distinct activation patterns between related and unrelated facts. (3) Features achieve better privacy protection than neurons, demonstrated through our proposed FeatureEdit method, which significantly outperforms existing neuron-based approaches in erasing privacy-sensitive information from LMs.Code and dataset will be available.

The Knowledge Microscope: Features as Better Analytical Lenses than Neurons

TL;DR

The paper addresses polysemanticity in neuron-based representations of factual knowledge in language models and proposes decomposing neurons into features via Sparse Autoencoders (SAE) as alternative analytical units. It shows that feature units, particularly post-MLP residual features, have greater impact on knowledge expression () and superior interpretability () than neurons, and they exhibit stronger monosemanticity. A feature-based Knowledge Erasure method, FeatureEdit, outperforms neuron-based approaches on the PrivacyParaRel dataset across Rel, Gen, Loc, and PPL. The work uses Gemma Scope SAEs on Gemma-2 models and the ParaRel privacy dataset to demonstrate results and provides practical privacy-preserving editing implications, advocating a shift to feature-based mechanistic interpretability in LMs.

Abstract

Previous studies primarily utilize MLP neurons as units of analysis for understanding the mechanisms of factual knowledge in Language Models (LMs); however, neurons suffer from polysemanticity, leading to limited knowledge expression and poor interpretability. In this paper, we first conduct preliminary experiments to validate that Sparse Autoencoders (SAE) can effectively decompose neurons into features, which serve as alternative analytical units. With this established, our core findings reveal three key advantages of features over neurons: (1) Features exhibit stronger influence on knowledge expression and superior interpretability. (2) Features demonstrate enhanced monosemanticity, showing distinct activation patterns between related and unrelated facts. (3) Features achieve better privacy protection than neurons, demonstrated through our proposed FeatureEdit method, which significantly outperforms existing neuron-based approaches in erasing privacy-sensitive information from LMs.Code and dataset will be available.

Paper Structure

This paper contains 53 sections, 28 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Comparison of research units for factual knowledge mechanisms in LMs: (a) neurons and (b) features. Colors in neurons (or features) correspond to the facts they store, illustrating how specific facts are encoded in particular units.
  • Figure 2: Evaluation of features obtained by different methods. Top: $\Delta$ Prob after feature ablation. Bottom: Interpretation scores ($IS$). Higher values indicate better performance in both metrics.
  • Figure 3: Distribution plots of activated features under different feature number settings ($n\times 9216, n=1,2,4,8$) for Gemma-2 2B. The similar distribution patterns across different $n$ suggest that features consistently fall into similar regions. It should be noted that these four pictures are not exactly the same, but they are very similar.
  • Figure 4: The impact on $\Delta Prob$ when ablating features from different transformer components and neurons. Values show mean $\pm$ standard error across 5 bootstrap iterations, with higher values indicating greater influence on knowledge expression. Note that while $\Delta Prob \in [0,1]$, the plots may exceed 1 due to $+$ std.
  • Figure 5: The impact on $\Delta Prob$ when ablating features from different transformer components and neurons.
  • ...and 6 more figures