Network-based Topic Structure Visualization
Yeseul Jeon, Jina Park, Ick Hoon Jin, Dongjun Chungc
TL;DR
The paper addresses the difficulty of analyzing inter-correlated topics by proposing a Gaussian latent-space item-response model (LSIRM) applied to a topic–words matrix derived from Biterm Topic Model outputs. It simultaneously estimates latent topic positions and uses a Procrustes alignment and oblique rotation to enable intuitive, two-dimensional topic visualization, while a composite word-score $s_{i,j}$ integrates topic probabilities and latent distances for interpretable word selection. Applied to COVID-19 PubMed literature, the method reveals three coherent topic clusters—outbreak/impact, symptoms/treatment, and molecular-body processes—and demonstrates robustness of topic interpretation across different word-set sizes. The approach offers a practical, interpretable framework for exploring topic networks without relying on ad hoc similarity measures, with publicly available software and data for replication.
Abstract
In the real world, many topics are inter-correlated, making it challenging to investigate their structure and relationships. Understanding the interplay between topics and their relevance can provide valuable insights for researchers, guiding their studies and informing the direction of research. In this paper, we utilize the topic-words distribution, obtained from topic models, as item-response data to model the structure of topics using a latent space item response model. By estimating the latent positions of topics based on their distances toward words, we can capture the underlying topic structure and reveal their relationships. Visualizing the latent positions of topics in Euclidean space allows for an intuitive understanding of their proximity and associations. We interpret relationships among topics by characterizing each topic based on representative words selected using a newly proposed scoring scheme. Additionally, we assess the maturity of topics by tracking their latent positions using different word sets, providing insights into the robustness of topics. To demonstrate the effectiveness of our approach, we analyze the topic composition of COVID-19 studies during the early stage of its emergence using biomedical literature in the PubMed database. The software and data used in this paper are publicly available at https://github.com/jeon9677/gViz .
