Table of Contents
Fetching ...

ExpertLens: Activation steering features are highly interpretable

Masha Fedzechkina, Eleonora Gualdoni, Sinead Williamson, Katherine Metcalf, Skyler Seto, Barry-John Theobald

TL;DR

Activation steering features in large language models can be interpretable when examined with ExpertLens. By defining concepts via positive/negative sentence sets and scoring neuron expertise with $AP$, then selecting top neurons using a threshold $\tau$, the authors demonstrate that ExpertLens yields stable, human-aligned concept representations across models and datasets. The study shows that ExpertLens representations align closely with human similarity judgments (often surpassing embeddings) and reveal human-like domain structures, which emerge progressively during training and scale with model size. This lightweight methodology offers practical interpretability and potential avenues for safety alignment and data-centric model debugging, while noting limitations such as single-word concepts and reliance on specific model families.

Abstract

Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering methods are interpretable. We identify neurons responsible for specific concepts (e.g., ``cat'') using the ``finding experts'' method from research on activation steering and show that the ExpertLens, i.e., inspection of these neurons provides insights about model representation. We find that ExpertLens representations are stable across models and datasets and closely align with human representations inferred from behavioral data, matching inter-human alignment levels. ExpertLens significantly outperforms the alignment captured by word/sentence embeddings. By reconstructing human concept organization through ExpertLens, we show that it enables a granular view of LLM concept representation. Our findings suggest that ExpertLens is a flexible and lightweight approach for capturing and analyzing model representations.

ExpertLens: Activation steering features are highly interpretable

TL;DR

Activation steering features in large language models can be interpretable when examined with ExpertLens. By defining concepts via positive/negative sentence sets and scoring neuron expertise with , then selecting top neurons using a threshold , the authors demonstrate that ExpertLens yields stable, human-aligned concept representations across models and datasets. The study shows that ExpertLens representations align closely with human similarity judgments (often surpassing embeddings) and reveal human-like domain structures, which emerge progressively during training and scale with model size. This lightweight methodology offers practical interpretability and potential avenues for safety alignment and data-centric model debugging, while noting limitations such as single-word concepts and reliance on specific model families.

Abstract

Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering methods are interpretable. We identify neurons responsible for specific concepts (e.g., ``cat'') using the ``finding experts'' method from research on activation steering and show that the ExpertLens, i.e., inspection of these neurons provides insights about model representation. We find that ExpertLens representations are stable across models and datasets and closely align with human representations inferred from behavioral data, matching inter-human alignment levels. ExpertLens significantly outperforms the alignment captured by word/sentence embeddings. By reconstructing human concept organization through ExpertLens, we show that it enables a granular view of LLM concept representation. Our findings suggest that ExpertLens is a flexible and lightweight approach for capturing and analyzing model representations.

Paper Structure

This paper contains 48 sections, 30 figures, 2 tables.

Figures (30)

  • Figure 1: ExpertLens is relatively stable across various dataset characteristics. Points represent condition means; error bars represent bootstrapped $95\%$ confidence intervals. Columns and rows represent the size (number of unique sentences) of the positive and negative sets respectively. Inter-concept is within-concept expert overlap; intra-concept is expert overlap averaged across randomly sampled pairs of concepts. See App. \ref{['app:pilot_set_sizes']} for corresponding expert set sizes.
  • Figure 2: ExpertLens representations are closely aligned with human ones. Points are Spearman correlations between the expert neuron overlap and perceived human similarity in the MEN dataset (significant after checkpoint 1, p<0.05); error bars are bootstrapped $95$% confidence intervals. The subplots are labeled with $\tau$.
  • Figure 3: ExpertLens representations are more closely aligned with human ones than the embeddings. Points are Spearman correlations between LLM similarity and human similarity in the MEN dataset; error bars are bootstrapped $95$% confidence intervals. The subplots are similarity type: ExpertLens are best-performing $\tau$ of Jaccard similarity (0.5), significant (p<0.05) after checkpoint 1; sentence embeddings are the average last-layer embeddings over the positive set, significant after checkpoint 1; single-word embeddings are from the embeddings layer, significant after checkpoint 4k for the $12$b models and after checkpoint 1k for other sizes.
  • Figure 4: ExpertLens representations reconstruct human conceptual structure in Pythia-12b. Each node represents a concept; edge thickness corresponds to Jaccard similarity between concepts in the expert space.
  • Figure 5: Expert set size (log) by model size and checkpoint. Points are averages over all concepts; error bars are bootstrapped $95$% confidence intervals. Subplots are different values of $\tau$.
  • ...and 25 more figures