Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
Xinyuan Yan, Shusen Liu, Kowshik Thopalli, Bei Wang
TL;DR
SAEs reveal millions of sparse directions whose visualization is intractable; this work proposes a focused, topology-based exploration framework that uses curated concept sets and Ball Mapper embeddings implemented in the SAE Semantic Explorer to preserve local and global relationships across model layers. The approach complements traditional projection methods with topology-aware representations, enabling targeted, interpretable analysis of concept representations in latent space across 26 residual layers and 65k features from Gemma Scope, with explanations from Neuronpedia. Key contributions include a scalable, interactive multi-view tool, demonstration on human-curated (THINGS) and discipline-based (Subjects) concept sets, and insights into the evolution of concept structure across layers. The work advances practical interpretability by enabling concept-level interventions and model editing workflows, while acknowledging limitations in auto-generated explanations and pointing to future enhancements in visualization and explanation pipelines.
Abstract
Sparse autoencoders (SAEs) have emerged as a powerful tool for uncovering interpretable features in large language models (LLMs) through the sparse directions they learn. However, the sheer number of extracted directions makes comprehensive exploration intractable. While conventional embedding techniques such as UMAP can reveal global structure, they suffer from limitations including high-dimensional compression artifacts, overplotting, and misleading neighborhood distortions. In this work, we propose a focused exploration framework that prioritizes curated concepts and their corresponding SAE features over attempts to visualize all available features simultaneously. We present an interactive visualization system that combines topology-based visual encoding with dimensionality reduction to faithfully represent both local and global relationships among selected features. This hybrid approach enables users to investigate SAE behavior through targeted, interpretable subsets, facilitating deeper and more nuanced analysis of concept representation in latent space.
