Table of Contents
Fetching ...

Embedded Topic Models Enhanced by Wikification

Takashi Shibuya, Takehito Utsuro

TL;DR

This work tackles word homography in topic modeling by injecting entity knowledge from Wikipedia into neural topic models. By combining wikification-based entity linking with Wikipedia2Vec embeddings, the approach feeds ETM and Dynamic ETM with both word and entity representations, enabling disambiguation of homographs like apple and amazon and enriching topic interpretability. Empirical results on NYT and AIDA-CoNLL show improved generalization (perplexity) and sensible temporal topic dynamics, with qualitative topic-transition visualizations highlighting increased interpretability through entity mentions. The method demonstrates potential for more accurate, entity-aware topic analyses in corpora with ambiguous terms, while highlighting the dependence on high-quality entity linking and embedding biases.

Abstract

Topic modeling analyzes a collection of documents to learn meaningful patterns of words. However, previous topic models consider only the spelling of words and do not take into consideration the homography of words. In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities. We evaluate our method on two datasets, 1) news articles of \textit{New York Times} and 2) the AIDA-CoNLL dataset. Our experiments show that our method improves the performance of neural topic models in generalizability. Moreover, we analyze frequent terms in each topic and the temporal dependencies between topics to demonstrate that our entity-aware topic models can capture the time-series development of topics well.

Embedded Topic Models Enhanced by Wikification

TL;DR

This work tackles word homography in topic modeling by injecting entity knowledge from Wikipedia into neural topic models. By combining wikification-based entity linking with Wikipedia2Vec embeddings, the approach feeds ETM and Dynamic ETM with both word and entity representations, enabling disambiguation of homographs like apple and amazon and enriching topic interpretability. Empirical results on NYT and AIDA-CoNLL show improved generalization (perplexity) and sensible temporal topic dynamics, with qualitative topic-transition visualizations highlighting increased interpretability through entity mentions. The method demonstrates potential for more accurate, entity-aware topic analyses in corpora with ambiguous terms, while highlighting the dependence on high-quality entity linking and embedding biases.

Abstract

Topic modeling analyzes a collection of documents to learn meaningful patterns of words. However, previous topic models consider only the spelling of words and do not take into consideration the homography of words. In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities. We evaluate our method on two datasets, 1) news articles of \textit{New York Times} and 2) the AIDA-CoNLL dataset. Our experiments show that our method improves the performance of neural topic models in generalizability. Moreover, we analyze frequent terms in each topic and the temporal dependencies between topics to demonstrate that our entity-aware topic models can capture the time-series development of topics well.
Paper Structure (21 sections, 3 figures, 3 tables)

This paper contains 21 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Processing flows of conventional topic models and our proposed topic model.
  • Figure 2: Difference between conventional embedded topic models and our proposed topic model.
  • Figure 3: Examples of topic transition. We present the top five most frequent terms in each topic.