Embedded Topic Models Enhanced by Wikification
Takashi Shibuya, Takehito Utsuro
TL;DR
This work tackles word homography in topic modeling by injecting entity knowledge from Wikipedia into neural topic models. By combining wikification-based entity linking with Wikipedia2Vec embeddings, the approach feeds ETM and Dynamic ETM with both word and entity representations, enabling disambiguation of homographs like apple and amazon and enriching topic interpretability. Empirical results on NYT and AIDA-CoNLL show improved generalization (perplexity) and sensible temporal topic dynamics, with qualitative topic-transition visualizations highlighting increased interpretability through entity mentions. The method demonstrates potential for more accurate, entity-aware topic analyses in corpora with ambiguous terms, while highlighting the dependence on high-quality entity linking and embedding biases.
Abstract
Topic modeling analyzes a collection of documents to learn meaningful patterns of words. However, previous topic models consider only the spelling of words and do not take into consideration the homography of words. In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities. We evaluate our method on two datasets, 1) news articles of \textit{New York Times} and 2) the AIDA-CoNLL dataset. Our experiments show that our method improves the performance of neural topic models in generalizability. Moreover, we analyze frequent terms in each topic and the temporal dependencies between topics to demonstrate that our entity-aware topic models can capture the time-series development of topics well.
