Table of Contents
Fetching ...

Navigating the Concept Space of Language Models

Wilson E. Marcílio-Jr, Danilo M. Eler

Abstract

Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.

Navigating the Concept Space of Language Models

Abstract

Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.
Paper Structure (10 sections, 3 equations, 4 figures)

This paper contains 10 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Concept Explorer user interface. The right panel shows HUMAP projections at two levels; the left panel shows explanation content, top-k activation contexts and annotation controls for a selected feature. The middle panel shows the projection level being explored.
  • Figure 2: HUMAP projection of explanation embeddings for SAE features from SmolLM2 (1.5M contexts, top-16 per feature). Colors indicate analyst-assigned coarse categories.
  • Figure 3: (a) Punctuation-related features clustered by explanation embeddings; (b) a rare feature focused on dashes and list markers. Table \ref{['tab:feature-ids-explanations']} shows relevant explanations for feature ids.
  • Figure 4: Marker and glyph concepts discovered by Concept Explorer. Reprojection of a level-1 region of influence to level 0 reveals fine-grained substructure.