Table of Contents
Fetching ...

The Geometric Structure of Topic Models

Johannes Hirth, Tom Hanika

TL;DR

This work reframes topic-model interpretation by treating term-topic and document-topic relations as incidence structures and applying Formal Concept Analysis to yield hierarchical concept lattices. It introduces ordinal motifs and a geometric drawing paradigm to reveal high-order relationships and temporal dynamics that classic heatmaps or embeddings miss. Incidence-based reduction techniques (TITANIC and pq-cores) produce readable, interpretable lattices, enabling robust association rules and zoomed-in topic analyses as demonstrated on the SSH21 machine-learning corpus. The approach offers a principled, global view of topic spaces with potential extensions to hierarchical models and user-centered visualizations.

Abstract

Topic models are a popular tool for clustering and analyzing textual data. They allow texts to be classified on the basis of their affiliation to the previously calculated topics. Despite their widespread use in research and application, an in-depth analysis of topic models is still an open research topic. State-of-the-art methods for interpreting topic models are based on simple visualizations, such as similarity matrices, top-term lists or embeddings, which are limited to a maximum of three dimensions. In this paper, we propose an incidence-geometric method for deriving an ordinal structure from flat topic models, such as non-negative matrix factorization. These enable the analysis of the topic model in a higher (order) dimension and the possibility of extracting conceptual relationships between several topics at once. Due to the use of conceptual scaling, our approach does not introduce any artificial topical relationships, such as artifacts of feature compression. Based on our findings, we present a new visualization paradigm for concept hierarchies based on ordinal motifs. These allow for a top-down view on topic spaces. We introduce and demonstrate the applicability of our approach based on a topic model derived from a corpus of scientific papers taken from 32 top machine learning venues.

The Geometric Structure of Topic Models

TL;DR

This work reframes topic-model interpretation by treating term-topic and document-topic relations as incidence structures and applying Formal Concept Analysis to yield hierarchical concept lattices. It introduces ordinal motifs and a geometric drawing paradigm to reveal high-order relationships and temporal dynamics that classic heatmaps or embeddings miss. Incidence-based reduction techniques (TITANIC and pq-cores) produce readable, interpretable lattices, enabling robust association rules and zoomed-in topic analyses as demonstrated on the SSH21 machine-learning corpus. The approach offers a principled, global view of topic spaces with potential extensions to hierarchical models and user-centered visualizations.

Abstract

Topic models are a popular tool for clustering and analyzing textual data. They allow texts to be classified on the basis of their affiliation to the previously calculated topics. Despite their widespread use in research and application, an in-depth analysis of topic models is still an open research topic. State-of-the-art methods for interpreting topic models are based on simple visualizations, such as similarity matrices, top-term lists or embeddings, which are limited to a maximum of three dimensions. In this paper, we propose an incidence-geometric method for deriving an ordinal structure from flat topic models, such as non-negative matrix factorization. These enable the analysis of the topic model in a higher (order) dimension and the possibility of extracting conceptual relationships between several topics at once. Due to the use of conceptual scaling, our approach does not introduce any artificial topical relationships, such as artifacts of feature compression. Based on our findings, we present a new visualization paradigm for concept hierarchies based on ordinal motifs. These allow for a top-down view on topic spaces. We introduce and demonstrate the applicability of our approach based on a topic model derived from a corpus of scientific papers taken from 32 top machine learning venues.
Paper Structure (22 sections, 1 equation, 15 figures, 2 tables)

This paper contains 22 sections, 1 equation, 15 figures, 2 tables.

Figures (15)

  • Figure 1: The similarity heatmap of the topics from the SSH21 topic model TopicSpaceTrajectories.
  • Figure 2: Visualizations of the SSH21 topic model from the literature. The similarity heatmap (top) is from Figure 4.1 in Schäfermeier et al. TopicSpaceTrajectories, The vector space representation MappingResearchTrajectories and the heatmap from Figure 2 in MappingResearchTrajectories of the SSH21 topic model.
  • Figure 3: The weighted term-topic (left) and document-topic relations (right).
  • Figure 4: The density of the document-topic relation for given thresholds (left) and the concept lattice sizes for given entities and thresholds (right).
  • Figure 5: The concept lattice for the entities B. Schölkopf (top) and W. Nejdl (bottom).
  • ...and 10 more figures

Theorems & Definitions (1)

  • Definition 1: Geometric Structure