Table of Contents
Fetching ...

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina M. -C. Höhne, Oliver Eberle

TL;DR

PRISM addresses the challenge of polysemanticity in LLM feature descriptions by introducing a multi-concept framework that samples high-activation text, clusters embeddings to find recurring patterns, and uses an LLM to label clusters. It formalizes two complementary evaluation metrics: polysemanticity scoring (describing diversity among labels) and description scoring (AUROC and MAD comparing concept versus control samples), enabling robust, automated assessments of descriptions. Across multiple models and feature types, PRISM (especially the max-version) demonstrates superior ability to produce faithful, multi-faceted descriptions and to reveal diverse concept spaces, with human studies supporting alignment between automated and human judgments. The framework thus enhances interpretability by capturing multiple activation patterns per feature, enabling deeper insights into model representations and offering a scalable path toward universal concept discovery across architectures.

Abstract

Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Within the context of large language models (LLMs) for natural language processing (NLP), current automated neuron-level feature description methods face two key challenges: limited robustness and the assumption that each neuron encodes a single concept (monosemanticity), despite increasing evidence of polysemanticity. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework specifically designed to capture the complexity of features in LLMs. Unlike approaches that assign a single description per neuron, common in many automated interpretability methods in NLP, PRISM produces more nuanced descriptions that account for both monosemantic and polysemantic behavior. We apply PRISM to LLMs and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

TL;DR

PRISM addresses the challenge of polysemanticity in LLM feature descriptions by introducing a multi-concept framework that samples high-activation text, clusters embeddings to find recurring patterns, and uses an LLM to label clusters. It formalizes two complementary evaluation metrics: polysemanticity scoring (describing diversity among labels) and description scoring (AUROC and MAD comparing concept versus control samples), enabling robust, automated assessments of descriptions. Across multiple models and feature types, PRISM (especially the max-version) demonstrates superior ability to produce faithful, multi-faceted descriptions and to reveal diverse concept spaces, with human studies supporting alignment between automated and human judgments. The framework thus enhances interpretability by capturing multiple activation patterns per feature, enabling deeper insights into model representations and offering a scalable path toward universal concept discovery across architectures.

Abstract

Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Within the context of large language models (LLMs) for natural language processing (NLP), current automated neuron-level feature description methods face two key challenges: limited robustness and the assumption that each neuron encodes a single concept (monosemanticity), despite increasing evidence of polysemanticity. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework specifically designed to capture the complexity of features in LLMs. Unlike approaches that assign a single description per neuron, common in many automated interpretability methods in NLP, PRISM produces more nuanced descriptions that account for both monosemantic and polysemantic behavior. We apply PRISM to LLMs and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).

Paper Structure

This paper contains 53 sections, 6 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Overview of the PRISM framework. PRISM captures multiple concepts per feature, enabling the detection of both polysemantic and monosemantic features, unlike prior approaches that constrain each feature to a single description. For example, feature 3815 in layer 47 was previously labeled as monosemantic openai2024automatedinterpretability, whereas PRISM reveals that it responds to multiple distinct concepts. Polysemanticity scoring summarizes how diverse the concepts associated with a feature are, while description scoring assesses how well each concept aligns with the feature's activation distribution.
  • Figure 2: Steps for extracting feature descriptions with PRISM. In Step 1, PRISM processes a text dataset through the model and selects sentences from the top percentile of the activation distribution for a given feature. In Step 2, these high-activation sentences are embedded using a sentence encoder and clustered to identify recurring patterns. In Step 3, the top activating examples from each cluster are used to prompt an LLM, which generates descriptive labels for each cluster.
  • Figure 3: Comparison of PRISM (max) AUROC evaluation scores and PRISM polysemanticity scores across different models and layers.
  • Figure 4: Clustering of identified PRISM feature descriptions in GPT-2 XL. The $k_m=50$ meta-clusters are visualized using UMAP, with metalabels generated by Gemini 1.5 Pro and three randomly selected sample descriptions shown per cluster.
  • Figure 5: Cluster labeling comparison between human and PRISM (LLM). We compare cluster labeling for two features: a polysemantic feature (top row) and a monosemantic feature (bottom row). On the left, we show representative text spans from input samples that strongly activate the feature, grouped into five clusters based on shared patterns. Within each span, tokens with the highest activations are highlighted. On the right, we compare the cluster labels generated by PRISM (LLM-based) and a human annotator shown the same input as the model. Additionally, the human rates the conceptual coherence of the five cluster labels on a scale from 0.0 to 1.0, where lower values indicate more diverse (polysemantic) and higher values more consistent (monosemantic) labeling. This rating is directly compared with PRISM's polysemanticity score for the same feature.
  • ...and 12 more figures