Navigating the Concept Space of Language Models

Wilson E. Marcílio-Jr; Danilo M. Eler

Navigating the Concept Space of Language Models

Wilson E. Marcílio-Jr, Danilo M. Eler

Abstract

Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.

Navigating the Concept Space of Language Models

Abstract

Paper Structure (10 sections, 3 equations, 4 figures)

This paper contains 10 sections, 3 equations, 4 figures.

Introduction
Background
Sparse Autoencoders for Large Language Model Representations
HUMAP for hierarchical neighborhood embedding
Methods
Concept Explorer
Use Case - Exploring Concepts from SmolLM2
Punctuation cluster
Conclusion
Appendix

Figures (4)

Figure 1: Concept Explorer user interface. The right panel shows HUMAP projections at two levels; the left panel shows explanation content, top-k activation contexts and annotation controls for a selected feature. The middle panel shows the projection level being explored.
Figure 2: HUMAP projection of explanation embeddings for SAE features from SmolLM2 (1.5M contexts, top-16 per feature). Colors indicate analyst-assigned coarse categories.
Figure 3: (a) Punctuation-related features clustered by explanation embeddings; (b) a rare feature focused on dashes and list markers. Table \ref{['tab:feature-ids-explanations']} shows relevant explanations for feature ids.
Figure 4: Marker and glyph concepts discovered by Concept Explorer. Reprojection of a level-1 region of influence to level 0 reveals fine-grained substructure.

Navigating the Concept Space of Language Models

Abstract

Navigating the Concept Space of Language Models

Authors

Abstract

Table of Contents

Figures (4)