Table of Contents
Fetching ...

Disentangling Similarity and Relatedness in Topic Models

Hanlin Xiao, Mauricio A. Álvarez, Rainer Breitling

TL;DR

This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora and demonstrates that similarity and relatedness scores successfully predict downstream task performance depending on task requirements.

Abstract

The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.

Disentangling Similarity and Relatedness in Topic Models

TL;DR

This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora and demonstrates that similarity and relatedness scores successfully predict downstream task performance depending on task requirements.

Abstract

The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.
Paper Structure (53 sections, 6 figures, 17 tables)

This paper contains 53 sections, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Topic model atlas on the Reuters corpus. Each distribution summarizes 10 runs per model; models are ordered by shifted normalized gap rank, and a vertical separator marks the boundary between non-PLM and PLM-augmented models.
  • Figure 2: Topic model atlas across five corpora.
  • Figure 3: Task C regression with corpus fixed effects: similarity score predicts synonym retrieval consistency (Jaccard@10) across six corpora ($R^2 = 0.46$, $p = 0.002$). Parallel lines show per-corpus intercepts with a shared slope.
  • Figure 4: UMAP projections of synthetic and corpus vocabularies in GloVe embedding space. Green: intersection; blue: synthetic-only; red: corpus-only. Across corpora, the two-lobe geometry is best interpreted as a frequency/domain tendency rather than a hard partition: the left lobe is enriched for common high-frequency words, whereas the right lobe is enriched for rarer, more domain-specific terms. General-domain corpora show dense overlap, while ACL and Reuters exhibit larger corpus-only tails.
  • Figure 5: Training curves for the neural scorer (v2_nn_crossdomain_glove): train loss (EMA-smoothed, shown by training step) with epoch-end validation/test losses, and test-set Spearman correlation over epochs. For readability, the loss-axis is clipped to suppress the initial training-loss spike.
  • ...and 1 more figures