Table of Contents
Fetching ...

Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts

Xinyuan Yan, Shusen Liu, Kowshik Thopalli, Bei Wang

TL;DR

SAEs reveal millions of sparse directions whose visualization is intractable; this work proposes a focused, topology-based exploration framework that uses curated concept sets and Ball Mapper embeddings implemented in the SAE Semantic Explorer to preserve local and global relationships across model layers. The approach complements traditional projection methods with topology-aware representations, enabling targeted, interpretable analysis of concept representations in latent space across 26 residual layers and 65k features from Gemma Scope, with explanations from Neuronpedia. Key contributions include a scalable, interactive multi-view tool, demonstration on human-curated (THINGS) and discipline-based (Subjects) concept sets, and insights into the evolution of concept structure across layers. The work advances practical interpretability by enabling concept-level interventions and model editing workflows, while acknowledging limitations in auto-generated explanations and pointing to future enhancements in visualization and explanation pipelines.

Abstract

Sparse autoencoders (SAEs) have emerged as a powerful tool for uncovering interpretable features in large language models (LLMs) through the sparse directions they learn. However, the sheer number of extracted directions makes comprehensive exploration intractable. While conventional embedding techniques such as UMAP can reveal global structure, they suffer from limitations including high-dimensional compression artifacts, overplotting, and misleading neighborhood distortions. In this work, we propose a focused exploration framework that prioritizes curated concepts and their corresponding SAE features over attempts to visualize all available features simultaneously. We present an interactive visualization system that combines topology-based visual encoding with dimensionality reduction to faithfully represent both local and global relationships among selected features. This hybrid approach enables users to investigate SAE behavior through targeted, interpretable subsets, facilitating deeper and more nuanced analysis of concept representation in latent space.

Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts

TL;DR

SAEs reveal millions of sparse directions whose visualization is intractable; this work proposes a focused, topology-based exploration framework that uses curated concept sets and Ball Mapper embeddings implemented in the SAE Semantic Explorer to preserve local and global relationships across model layers. The approach complements traditional projection methods with topology-aware representations, enabling targeted, interpretable analysis of concept representations in latent space across 26 residual layers and 65k features from Gemma Scope, with explanations from Neuronpedia. Key contributions include a scalable, interactive multi-view tool, demonstration on human-curated (THINGS) and discipline-based (Subjects) concept sets, and insights into the evolution of concept structure across layers. The work advances practical interpretability by enabling concept-level interventions and model editing workflows, while acknowledging limitations in auto-generated explanations and pointing to future enhancements in visualization and explanation pipelines.

Abstract

Sparse autoencoders (SAEs) have emerged as a powerful tool for uncovering interpretable features in large language models (LLMs) through the sparse directions they learn. However, the sheer number of extracted directions makes comprehensive exploration intractable. While conventional embedding techniques such as UMAP can reveal global structure, they suffer from limitations including high-dimensional compression artifacts, overplotting, and misleading neighborhood distortions. In this work, we propose a focused exploration framework that prioritizes curated concepts and their corresponding SAE features over attempts to visualize all available features simultaneously. We present an interactive visualization system that combines topology-based visual encoding with dimensionality reduction to faithfully represent both local and global relationships among selected features. This hybrid approach enables users to investigate SAE behavior through targeted, interpretable subsets, facilitating deeper and more nuanced analysis of concept representation in latent space.

Paper Structure

This paper contains 7 sections, 3 figures.

Figures (3)

  • Figure 1: SAE Semantic Explorer interface. A. Data view. Left: SAE features, a concept set (words with assigned categories), and a cosine-similarity threshold for retrieving relevant features. Right: bar chart showing the number of discovered concepts per layer. B. Category view. For the selected layer (23), each row displays a category’s feature count and its overlap with the pinned category food, facilitating comparison with animal. C. UMAP view. Retrieved features from the selected layer, with food and animal categories highlighted. D. Ball Mapper view. Topological graph showing the structural relationships among food and animal features. E. Feature view. Interactive panel displaying details of selected features via click or lasso selection. F. Concept query. Search interface for locating specific concepts.
  • Figure 2: Ball mapper construction example. Left: Original point cloud. Middle: For a given radius $\epsilon$, a subset of points (red) is selected as ball centers such that the resulting balls (1–7, blue) cover the entire dataset. Right: The resulting ball mapper graph represents each ball as a node, with an edge between two nodes if their corresponding balls share data points.
  • Figure 3: A. Top: a mapper node containing features 1–3, all related to the concept music album; bottom: querying fox across layers consistently yields Fox News, with wolf as its nearest neighbor. B. UMAP and ball mapper views for layers 0 and 25, highlighting features associated with food and electronic devices. C. A ball mapper path of subject concepts illustrating a transition from Mathematics to Computer Science.