Exploring Topic Trends in COVID-19 Research Literature using Non-Negative Matrix Factorization
Divya Patel, Vansh Parikh, Om Patel, Agam Shah, Bhaskar Chaudhury
TL;DR
This study addresses the challenge of mapping thematic structure and temporal evolution in COVID-19 research by applying Non-Negative Matrix Factorization (NMF) to a large CORD-19 full-text corpus after rigorous data cleaning, n-gram merging, and tf-idf feature selection. The document-term matrix is factorized as $X \approx WH$, with $W$ as the document-topic matrix and $H$ as the topic-word matrix, enabling interpretation of topics and their document distributions; stability analysis using Average Jaccard and Hungarian matching determines an optimal 20-topic model. A relevance-based term ranking via $R(w|t) = \lambda p(w|t) + (1-\lambda) \frac{p(w|t)}{p(w)}$ with $\lambda = 0.5$ refines topic-term associations. Temporal topic trends are derived by monthly averaging of $W$-normalized topic mixtures, revealing rising themes in vaccines, online education, mental health, and telemedicine, alongside declines in certain imaging, testing, and transmission topics, thereby illustrating the model’s utility for tracking knowledge evolution and informing research priorities.
Abstract
In this work, we apply topic modeling using Non-Negative Matrix Factorization (NMF) on the COVID-19 Open Research Dataset (CORD-19) to uncover the underlying thematic structure and its evolution within the extensive body of COVID-19 research literature. NMF factorizes the document-term matrix into two non-negative matrices, effectively representing the topics and their distribution across the documents. This helps us see how strongly documents relate to topics and how topics relate to words. We describe the complete methodology which involves a series of rigorous pre-processing steps to standardize the available text data while preserving the context of phrases, and subsequently feature extraction using the term frequency-inverse document frequency (tf-idf), which assigns weights to words based on their frequency and rarity in the dataset. To ensure the robustness of our topic model, we conduct a stability analysis. This process assesses the stability scores of the NMF topic model for different numbers of topics, enabling us to select the optimal number of topics for our analysis. Through our analysis, we track the evolution of topics over time within the CORD-19 dataset. Our findings contribute to the understanding of the knowledge structure of the COVID-19 research landscape, providing a valuable resource for future research in this field.
