Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina M. -C. Höhne, Oliver Eberle
TL;DR
PRISM addresses the challenge of polysemanticity in LLM feature descriptions by introducing a multi-concept framework that samples high-activation text, clusters embeddings to find recurring patterns, and uses an LLM to label clusters. It formalizes two complementary evaluation metrics: polysemanticity scoring (describing diversity among labels) and description scoring (AUROC and MAD comparing concept versus control samples), enabling robust, automated assessments of descriptions. Across multiple models and feature types, PRISM (especially the max-version) demonstrates superior ability to produce faithful, multi-faceted descriptions and to reveal diverse concept spaces, with human studies supporting alignment between automated and human judgments. The framework thus enhances interpretability by capturing multiple activation patterns per feature, enabling deeper insights into model representations and offering a scalable path toward universal concept discovery across architectures.
Abstract
Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Within the context of large language models (LLMs) for natural language processing (NLP), current automated neuron-level feature description methods face two key challenges: limited robustness and the assumption that each neuron encodes a single concept (monosemanticity), despite increasing evidence of polysemanticity. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework specifically designed to capture the complexity of features in LLMs. Unlike approaches that assign a single description per neuron, common in many automated interpretability methods in NLP, PRISM produces more nuanced descriptions that account for both monosemantic and polysemantic behavior. We apply PRISM to LLMs and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).
