Disentangling Similarity and Relatedness in Topic Models

Hanlin Xiao; Mauricio A. Álvarez; Rainer Breitling

Disentangling Similarity and Relatedness in Topic Models

Hanlin Xiao, Mauricio A. Álvarez, Rainer Breitling

TL;DR

This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora and demonstrates that similarity and relatedness scores successfully predict downstream task performance depending on task requirements.

Abstract

The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.

Disentangling Similarity and Relatedness in Topic Models

TL;DR

Abstract

Paper Structure (53 sections, 6 figures, 17 tables)

This paper contains 53 sections, 6 figures, 17 tables.

Introduction
Related Work
Topic Models and Evaluation
Similarity vs. Relatedness
Method
Similarity--Relatedness Scorer
Training Data Construction
Neural Network Architecture and Training
Topic Model Atlas
Topic Model Taxonomy
Corpora
Atlas Construction and Consistency Analysis
Downstream Tasks
Task A: Event Monitoring (Reuters)
Task B: Category Retrieval (Reuters)
...and 38 more sections

Figures (6)

Figure 1: Topic model atlas on the Reuters corpus. Each distribution summarizes 10 runs per model; models are ordered by shifted normalized gap rank, and a vertical separator marks the boundary between non-PLM and PLM-augmented models.
Figure 2: Topic model atlas across five corpora.
Figure 3: Task C regression with corpus fixed effects: similarity score predicts synonym retrieval consistency (Jaccard@10) across six corpora ($R^2 = 0.46$, $p = 0.002$). Parallel lines show per-corpus intercepts with a shared slope.
Figure 4: UMAP projections of synthetic and corpus vocabularies in GloVe embedding space. Green: intersection; blue: synthetic-only; red: corpus-only. Across corpora, the two-lobe geometry is best interpreted as a frequency/domain tendency rather than a hard partition: the left lobe is enriched for common high-frequency words, whereas the right lobe is enriched for rarer, more domain-specific terms. General-domain corpora show dense overlap, while ACL and Reuters exhibit larger corpus-only tails.
Figure 5: Training curves for the neural scorer (v2_nn_crossdomain_glove): train loss (EMA-smoothed, shown by training step) with epoch-end validation/test losses, and test-set Spearman correlation over epochs. For readability, the loss-axis is clipped to suppress the initial training-loss spike.
...and 1 more figures

Disentangling Similarity and Relatedness in Topic Models

TL;DR

Abstract

Disentangling Similarity and Relatedness in Topic Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)