Table of Contents
Fetching ...

Sparse Autoencoders are Topic Models

Leander Girrbach, Zeynep Akata

TL;DR

Sparse autoencoders (SAEs) are recast as topic models by extending Latent Dirichlet Allocation (LDA) to embedding spaces and interpreting SAE features as thematic topic atoms under a continuous-topic model. The authors derive the SAE objective as a MAP estimator within this CTM, enabling a framework (SAE-TM) that pretrains foundational SAEs to learn reusable topic directions, interprets those directions as word distributions on downstream data, and merges them into any desired number of topics without retraining. Empirical results across five text datasets and three image datasets show SAE-TM achieving superior topic coherence (often at scale) while maintaining reasonable diversity, and enabling downstream analyses such as cross-dataset thematic comparisons and historical art trend studies (e.g., Japanese woodblock prints). The approach offers a scalable, modality-agnostic toolkit for large-scale thematic analysis, with practical implications for dataset understanding and multimodal interpretation; code and data are slated for release.

Abstract

Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.

Sparse Autoencoders are Topic Models

TL;DR

Sparse autoencoders (SAEs) are recast as topic models by extending Latent Dirichlet Allocation (LDA) to embedding spaces and interpreting SAE features as thematic topic atoms under a continuous-topic model. The authors derive the SAE objective as a MAP estimator within this CTM, enabling a framework (SAE-TM) that pretrains foundational SAEs to learn reusable topic directions, interprets those directions as word distributions on downstream data, and merges them into any desired number of topics without retraining. Empirical results across five text datasets and three image datasets show SAE-TM achieving superior topic coherence (often at scale) while maintaining reasonable diversity, and enabling downstream analyses such as cross-dataset thematic comparisons and historical art trend studies (e.g., Japanese woodblock prints). The approach offers a scalable, modality-agnostic toolkit for large-scale thematic analysis, with practical implications for dataset understanding and multimodal interpretation; code and data are slated for release.

Abstract

Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.

Paper Structure

This paper contains 16 sections, 19 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Sparse autoencoders are topic models: the connection clarifies the nature of SAEs and enables our SAE Topic Model (SAE-TM) to find more coherent topics than other methods. Score = intruder detection accuracy, avg. over five datasets at 50 topics.
  • Figure 2: Overview of our SAE topic model (SAE-TM): (a) pretrain foundational SAEs on large text or vision datasets to learn transferable atomic directions; (b) interpret relevant SAE features on downstream datasets by associating each feature with a distribution over words; (c) cluster SAE feature embeddings derived from their top associated words via $k$-means and merge clustered features into topics, aggregating their word distributions. Colors indicate modality (green = text, blue = vision) and trainable (orange) vs. frozen (grey) components.
  • Figure 3: Statistics of top 10 topics with the highest variance across four popular image datasets. Values indicate the proportion of images in each dataset where the topic is active (even weakly). Differences between datasets reveal interesting trends, such as a comparatively higher frequency of images of animals and plants in ImageNet compared to web-sourced datasets.
  • Figure 4: Statistics of top 10 topics with the highest variance in Japanese woodblock prints from different artistic periods. Changes in topic distribution reflect changing cultural environment (e.g., clothing) and popular themes (e.g., domestic scenes vs. nature).
  • Figure 5: Qualitative examples showing three Japanese Woodblock prints and LLM summaries of the top five topics assigned by our SAE-TM.