Table of Contents
Fetching ...

Bidirectional Topic Matching: Quantifying Thematic Overlap Between Corpora Through Topic Modelling

Raven Adam, Marie Lisa Kogler

TL;DR

Bidirectional Topic Matching (BTM) introduces a cross-corpus analysis framework that trains separate topic models for each corpus and applies them reciprocally to quantify thematic overlap and divergence. By computing cross-topic assignments, co-occurrence-based pairings, and a set of new corpus-level measures ($C$, $C_w$, $U$, $U_w$, $A$, $A_w$), BTM captures both shared themes and corpus-specific topics while handling outliers via model-specific topic spaces. Validation against cosine similarity and Cohen’s kappa demonstrates robust agreement and highlights areas where outliers reveal methodological strengths. A climate-news case study from German-language corpora shows substantial overlap alongside meaningful topic-specific distinctions, illustrating BTM’s practical utility for interdisciplinary discourse analysis and its potential extensions to multilingual and temporal datasets.

Abstract

This study introduces Bidirectional Topic Matching (BTM), a novel method for cross-corpus topic modeling that quantifies thematic overlap and divergence between corpora. BTM is a flexible framework that can incorporate various topic modeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet Allocation (LDA). BTM employs a dual-model approach, training separate topic models for each corpus and applying them reciprocally to enable comprehensive cross-corpus comparisons. This methodology facilitates the identification of shared themes and unique topics, providing nuanced insights into thematic relationships. Validation against cosine similarity-based methods demonstrates the robustness of BTM, with strong agreement metrics and distinct advantages in handling outlier topics. A case study on climate news articles showcases BTM's utility, revealing significant thematic overlaps and distinctions between corpora focused on climate change and climate action. BTM's flexibility and precision make it a valuable tool for diverse applications, from political discourse analysis to interdisciplinary studies. By integrating shared and unique topic analyses, BTM offers a comprehensive framework for exploring thematic relationships, with potential extensions to multilingual and dynamic datasets. This work highlights BTM's methodological contributions and its capacity to advance discourse analysis across various domains.

Bidirectional Topic Matching: Quantifying Thematic Overlap Between Corpora Through Topic Modelling

TL;DR

Bidirectional Topic Matching (BTM) introduces a cross-corpus analysis framework that trains separate topic models for each corpus and applies them reciprocally to quantify thematic overlap and divergence. By computing cross-topic assignments, co-occurrence-based pairings, and a set of new corpus-level measures (, , , , , ), BTM captures both shared themes and corpus-specific topics while handling outliers via model-specific topic spaces. Validation against cosine similarity and Cohen’s kappa demonstrates robust agreement and highlights areas where outliers reveal methodological strengths. A climate-news case study from German-language corpora shows substantial overlap alongside meaningful topic-specific distinctions, illustrating BTM’s practical utility for interdisciplinary discourse analysis and its potential extensions to multilingual and temporal datasets.

Abstract

This study introduces Bidirectional Topic Matching (BTM), a novel method for cross-corpus topic modeling that quantifies thematic overlap and divergence between corpora. BTM is a flexible framework that can incorporate various topic modeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet Allocation (LDA). BTM employs a dual-model approach, training separate topic models for each corpus and applying them reciprocally to enable comprehensive cross-corpus comparisons. This methodology facilitates the identification of shared themes and unique topics, providing nuanced insights into thematic relationships. Validation against cosine similarity-based methods demonstrates the robustness of BTM, with strong agreement metrics and distinct advantages in handling outlier topics. A case study on climate news articles showcases BTM's utility, revealing significant thematic overlaps and distinctions between corpora focused on climate change and climate action. BTM's flexibility and precision make it a valuable tool for diverse applications, from political discourse analysis to interdisciplinary studies. By integrating shared and unique topic analyses, BTM offers a comprehensive framework for exploring thematic relationships, with potential extensions to multilingual and dynamic datasets. This work highlights BTM's methodological contributions and its capacity to advance discourse analysis across various domains.

Paper Structure

This paper contains 24 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Schematic Outline of Bidirectional Topic Matching Procedures for Calculating the Thematic Closeness Factor of Corpus 1 and Corpus 2. Optional additional analysis of topic similarity may be conducted via cosine similarity.
  • Figure 2: Left side – The largest native topic from corpus 1 along with the five most prominent cross topic pairs from corpus 2. They gray area indicates the pairing strength for each pair. Right side – The largest native topic from corpus 2 along with the five most prominent cross topic pairs from corpus 1. They gray area indicates the pairing strength for each pair.
  • Figure 3: The pairing strength composition for the 25 largest native topics from corpus 1. The shading of the bars indicates the ranking of the topic pairing strengths, where the most prominent pair is represented by the darkest color. Topic pairs with a pairing strength below 0.05 were merged into the “remaining topic” category. The outlier topic pairing strength or topic uniqueness is indicated by the red dashed bars.
  • Figure 4: The pairing strength composition for the 25 largest native topics from corpus 2. The shading of the bars indicates the ranking of the topic pairing strengths, where the most prominent pair is represented by the darkest color. Topic pairs with a pairing strength below 0.05 were merged into the “remaining topic” category. The outlier topic pairing strength or topic uniqueness is indicated by the red dashed bars.