Table of Contents
Fetching ...

Exploring the evolution of research topics during the COVID-19 pandemic

Francesco Invernici, Anna Bernasconi, Stefano Ceri

TL;DR

The paper addresses how research topics evolved during the COVID-19 pandemic by building CORToViz, a full-stack pipeline that ingests the CORD-19 corpus and extracts temporally-aware topics. It leverages large language model embeddings (SPECTER) with UMAP and HDBSCAN, integrated into BERTopic, to cluster abstracts and generate human-readable topic representations via TF-IDF word clouds. Temporal dynamics are captured through relative-frequency time series across 1–4 week bins, with statistical significance assessed by the Kruskal-Wallis test and an interactive Streamlit dashboard for exploration. The work demonstrates the approach on CORD-19 and a climate-change corpus to showcase generality, emphasizing fast, one-click analytics, interpretable topic trends, and applicability to other textual repositories. This methodology enables researchers and stakeholders to quickly grasp how scientific focus shifted over the course of the pandemic and in other domains, supporting lightweight analytics and domain-agnostic adaptations.

Abstract

The COVID-19 pandemic has changed the research agendas of most scientific communities, resulting in an overwhelming production of research articles in a variety of domains, including medicine, virology, epidemiology, economy, psychology, and so on. Several open-access corpora and literature hubs were established; among them, the COVID-19 Open Research Dataset (CORD-19) has systematically gathered scientific contributions for 2.5 years, by collecting and indexing over one million articles. Here, we present the CORD-19 Topic Visualizer (CORToViz), a method and associated visualization tool for inspecting the CORD-19 textual corpus of scientific abstracts. Our method is based upon a careful selection of up-to-date technologies (including large language models), resulting in an architecture for clustering articles along orthogonal dimensions and extraction techniques for temporal topic mining. Topic inspection is supported by an interactive dashboard, providing fast, one-click visualization of topic contents as word clouds and topic trends as time series, equipped with easy-to-drive statistical testing for analyzing the significance of topic emergence along arbitrarily selected time windows. The processes of data preparation and results visualization are completely general and virtually applicable to any corpus of textual documents - thus suited for effective adaptation to other contexts.

Exploring the evolution of research topics during the COVID-19 pandemic

TL;DR

The paper addresses how research topics evolved during the COVID-19 pandemic by building CORToViz, a full-stack pipeline that ingests the CORD-19 corpus and extracts temporally-aware topics. It leverages large language model embeddings (SPECTER) with UMAP and HDBSCAN, integrated into BERTopic, to cluster abstracts and generate human-readable topic representations via TF-IDF word clouds. Temporal dynamics are captured through relative-frequency time series across 1–4 week bins, with statistical significance assessed by the Kruskal-Wallis test and an interactive Streamlit dashboard for exploration. The work demonstrates the approach on CORD-19 and a climate-change corpus to showcase generality, emphasizing fast, one-click analytics, interpretable topic trends, and applicability to other textual repositories. This methodology enables researchers and stakeholders to quickly grasp how scientific focus shifted over the course of the pandemic and in other domains, supporting lightweight analytics and domain-agnostic adaptations.

Abstract

The COVID-19 pandemic has changed the research agendas of most scientific communities, resulting in an overwhelming production of research articles in a variety of domains, including medicine, virology, epidemiology, economy, psychology, and so on. Several open-access corpora and literature hubs were established; among them, the COVID-19 Open Research Dataset (CORD-19) has systematically gathered scientific contributions for 2.5 years, by collecting and indexing over one million articles. Here, we present the CORD-19 Topic Visualizer (CORToViz), a method and associated visualization tool for inspecting the CORD-19 textual corpus of scientific abstracts. Our method is based upon a careful selection of up-to-date technologies (including large language models), resulting in an architecture for clustering articles along orthogonal dimensions and extraction techniques for temporal topic mining. Topic inspection is supported by an interactive dashboard, providing fast, one-click visualization of topic contents as word clouds and topic trends as time series, equipped with easy-to-drive statistical testing for analyzing the significance of topic emergence along arbitrarily selected time windows. The processes of data preparation and results visualization are completely general and virtually applicable to any corpus of textual documents - thus suited for effective adaptation to other contexts.
Paper Structure (3 sections, 1 equation, 6 figures, 1 table)

This paper contains 3 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Visualizations of the exploratory analyis of CORD-19 data and metadata. (A) Monthly number of publications in CORD-19. The number increases in the first months of 2020, then is rather stable, until April 2022, when the trend started decreasing; CORD-19 was updated until June 2022. In light color, the spikes of publications with just the year in their metadata were converted to the first of January; these entries were removed. (B) Data-density display of eight metadata fields for a sample of 20% of the dataset. We retain articles with abstract and publish_time metadata. (C) Distribution of the number of duplicates. The majority of articles, on the left of the distribution, have a single duplicate (typically without the doi), representing a preprint non-peer-reviewed version uploaded on public archives before publication; only a few documents are present in the dataset with a high number of replicated entries. (D) Silhouette score for k-Means for different values of k, which indicates the number of clusters. A spike in the line plot means that that value is a good candidate for the number of clusters; the figure clearly indicates that five is a good candidate, then selected for the exploratory clustering analysis.
  • Figure 2: Topic clustering produced by the preliminary and fine-grain clustering methods. We show and compare the clusters introduced in the Results section. (A) Scatter plot of the exploratory clustering analysis. The analysis has been performed with k-Means, a classic clustering algorithm. We found five macro-topics and we assessed their content with word clouds. As shown in the figure, the five clusters identify distinct classes of topics, well described by word clouds, which nicely partition the set of articles of CORD-19. (B) Dendrograms of the hierarchical density-based clustering. We then explored topics using a technology-rich pipeline, resulting in a fine-grain topic clustering. The high-level cluster hierarchy, with only 29 clusters, resembles the five macro-topics structure of the preliminary clustering. The full hierarchy includes 354 fine-grained clusters, each related to a specific high-level cluster. We show the hierarchy of the n-glycans-related topics, of the ACE2-related topics, and of an epidemiology-related topic.
  • Figure 3: General architecture of CORToViz. The data pipeline consists of three stages: data preparation (red), hyperparameter optimization (yellow), and topic extraction using the BERTopic model (green); the pipeline produces as output the ingredients for the dashboard application, a user-friendly interface for topic selection and display. In the data pipeline, the data preparation step selects the abstracts with the appropriate metadata from CORD-19; the hyper-parameter optimization finds the values that maximize the performance of the models operating on embeddings; finally, the data transformation generates the artifacts used by the CORToViz dashboard application. The dashboard supports keyword-based topic search and then visualizes the time series information for each topic; each topic is associated with a word cloud, providing insight into the topic's content.
  • Figure 4: User interface of the CORToViz dashboard. (A) Keyword-based search bar - the example query "ventilator" is entered by the user. (B) Six top-ranked topics, explained through their word clouds. The user selects two topics " ventilator" and " prone-positioning". (C) Line plot of the intensities (i.e., the relative frequencies of appearance) of the two selected topics. The user sets (above) the bin resolution to 2 weeks (options are 1-4 weeks). Histograms (below) show the count of articles associated with the selected topics, with the given bin size of two weeks. (D) Panel showing statistical testing. The user selects the " ventilator" topic and sets two time windows, a six-month window at the beginning of the pandemic and a 6-month window at the end of the second year of the pandemic. The tool, at the bottom, reports the result of the Kruskal-Wallis test for the difference between groups. Specifically, it shows the H statistic of 12.89752, which is the statistic of the aforementioned non-parametric test, and that determines the p-value (0.00033). Therefore, since the p-value is below the 5% threshold, the null hypothesis (i.e., no difference in groups) can be rejected, and a green check indicates a statistically significant difference between observations in the two intervals.
  • Figure 5: Visualizations of relevant example cases. Each panel corresponds to the search of a keyword on CORToViz (see the title of plots). For each one, we show the word clouds generated for two topics and the line plots of the topics' time series. (A) Variant: topic on (sub)variants, among which omicron, whose spike anticipates a peak in active COVID cases shown in the background; topic on delta that increases when the variant spreads worldwide; (B) Vaccine: generic topic showing an increase in interest over time; immunization topic, more specific, with a similar trend. (C) Outbreak: epidemic-related topic interesting at the beginning, but not much interesting after the first months; influenza, a topic with an early peak representing the large fraction of articles written on influenza prior to COVID, then less relevant and almost unrelated to COVID cases. (D) Olfactory and long covid: the former peaking at the beginning of the pandemic and then decreasing; the latter showing a steadily increasing trend. (E) Pneumonia: the first topic is decreasing in mid-2020 while the second topic, highlighting other co-morbidities, grows in interest. (F) Telemedicine and contact tracing: the first is steadily interesting; instead contact tracing is most interesting in the first months, but then loses interest (as it revealed hard to deploy in reality).
  • ...and 1 more figures