Exploring the evolution of research topics during the COVID-19 pandemic
Francesco Invernici, Anna Bernasconi, Stefano Ceri
TL;DR
The paper addresses how research topics evolved during the COVID-19 pandemic by building CORToViz, a full-stack pipeline that ingests the CORD-19 corpus and extracts temporally-aware topics. It leverages large language model embeddings (SPECTER) with UMAP and HDBSCAN, integrated into BERTopic, to cluster abstracts and generate human-readable topic representations via TF-IDF word clouds. Temporal dynamics are captured through relative-frequency time series across 1–4 week bins, with statistical significance assessed by the Kruskal-Wallis test and an interactive Streamlit dashboard for exploration. The work demonstrates the approach on CORD-19 and a climate-change corpus to showcase generality, emphasizing fast, one-click analytics, interpretable topic trends, and applicability to other textual repositories. This methodology enables researchers and stakeholders to quickly grasp how scientific focus shifted over the course of the pandemic and in other domains, supporting lightweight analytics and domain-agnostic adaptations.
Abstract
The COVID-19 pandemic has changed the research agendas of most scientific communities, resulting in an overwhelming production of research articles in a variety of domains, including medicine, virology, epidemiology, economy, psychology, and so on. Several open-access corpora and literature hubs were established; among them, the COVID-19 Open Research Dataset (CORD-19) has systematically gathered scientific contributions for 2.5 years, by collecting and indexing over one million articles. Here, we present the CORD-19 Topic Visualizer (CORToViz), a method and associated visualization tool for inspecting the CORD-19 textual corpus of scientific abstracts. Our method is based upon a careful selection of up-to-date technologies (including large language models), resulting in an architecture for clustering articles along orthogonal dimensions and extraction techniques for temporal topic mining. Topic inspection is supported by an interactive dashboard, providing fast, one-click visualization of topic contents as word clouds and topic trends as time series, equipped with easy-to-drive statistical testing for analyzing the significance of topic emergence along arbitrarily selected time windows. The processes of data preparation and results visualization are completely general and virtually applicable to any corpus of textual documents - thus suited for effective adaptation to other contexts.
