Table of Contents
Fetching ...

An Information Retrieval and Extraction Tool for Covid-19 Related Papers

Marcos V. L. Pivetta

TL;DR

The paper addresses the surge of COVID-19 literature by proposing the CORD-19 Knowledge Tool (CORD-19 KTool), which integrates CODA-19-based research-aspect classification, LDA topic modeling, and SciSpacy-based NER with UMLS linking to enable fine-grained retrieval and entity extraction. It demonstrates a pipeline that extracts abstracts from CORD-19, segments text by research aspect, builds per-aspect topic models, and retrieves papers via KNN in topic space, plus regex-based search and interactive visualizations. The results show feasible topic-based organization and search capabilities, with explicit recognition of limitations in classifier accuracy, topic-label interpretability, and dataset scope, suggesting that combining multiple algorithms can improve exploration of COVID-19 literature. The work highlights a practical path toward automated, multi-criteria navigation of large biomedical corpora, potentially accelerating hypothesis generation and reference discovery during fast-moving health crises.

Abstract

Background: The COVID-19 pandemic has caused severe impacts on health systems worldwide. Its critical nature and the increased interest of individuals and organizations to develop countermeasures to the problem has led to a surge of new studies in scientific journals. Objetive: We sought to develop a tool that incorporates, in a novel way, aspects of Information Retrieval (IR) and Extraction (IE) applied to the COVID-19 Open Research Dataset (CORD-19). The main focus of this paper is to provide researchers with a better search tool for COVID-19 related papers, helping them find reference papers and hightlight relevant entities in text. Method: We applied Latent Dirichlet Allocation (LDA) to model, based on research aspects, the topics of all English abstracts in CORD-19. Relevant named entities of each abstract were extracted and linked to the corresponding UMLS concept. Regular expressions and the K-Nearest Neighbors algorithm were used to rank relevant papers. Results: Our tool has shown the potential to assist researchers by automating a topic-based search of CORD-19 papers. Nonetheless, we identified that more fine-tuned topic modeling parameters and increased accuracy of the research aspect classifier model could lead to a more accurate and reliable tool. Conclusion: We emphasize the need of new automated tools to help researchers find relevant COVID-19 documents, in addition to automatically extracting useful information contained in them. Our work suggests that combining different algorithms and models could lead to new ways of browsing COVID-19 paper data.

An Information Retrieval and Extraction Tool for Covid-19 Related Papers

TL;DR

The paper addresses the surge of COVID-19 literature by proposing the CORD-19 Knowledge Tool (CORD-19 KTool), which integrates CODA-19-based research-aspect classification, LDA topic modeling, and SciSpacy-based NER with UMLS linking to enable fine-grained retrieval and entity extraction. It demonstrates a pipeline that extracts abstracts from CORD-19, segments text by research aspect, builds per-aspect topic models, and retrieves papers via KNN in topic space, plus regex-based search and interactive visualizations. The results show feasible topic-based organization and search capabilities, with explicit recognition of limitations in classifier accuracy, topic-label interpretability, and dataset scope, suggesting that combining multiple algorithms can improve exploration of COVID-19 literature. The work highlights a practical path toward automated, multi-criteria navigation of large biomedical corpora, potentially accelerating hypothesis generation and reference discovery during fast-moving health crises.

Abstract

Background: The COVID-19 pandemic has caused severe impacts on health systems worldwide. Its critical nature and the increased interest of individuals and organizations to develop countermeasures to the problem has led to a surge of new studies in scientific journals. Objetive: We sought to develop a tool that incorporates, in a novel way, aspects of Information Retrieval (IR) and Extraction (IE) applied to the COVID-19 Open Research Dataset (CORD-19). The main focus of this paper is to provide researchers with a better search tool for COVID-19 related papers, helping them find reference papers and hightlight relevant entities in text. Method: We applied Latent Dirichlet Allocation (LDA) to model, based on research aspects, the topics of all English abstracts in CORD-19. Relevant named entities of each abstract were extracted and linked to the corresponding UMLS concept. Regular expressions and the K-Nearest Neighbors algorithm were used to rank relevant papers. Results: Our tool has shown the potential to assist researchers by automating a topic-based search of CORD-19 papers. Nonetheless, we identified that more fine-tuned topic modeling parameters and increased accuracy of the research aspect classifier model could lead to a more accurate and reliable tool. Conclusion: We emphasize the need of new automated tools to help researchers find relevant COVID-19 documents, in addition to automatically extracting useful information contained in them. Our work suggests that combining different algorithms and models could lead to new ways of browsing COVID-19 paper data.
Paper Structure (35 sections, 2 equations, 8 figures, 3 tables)

This paper contains 35 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Diagram describing the tool's use cases and interaction with CORD-19 abstracts and the topic models generated with LDA
  • Figure 2: List of topics addressed in the finding/contribution section of COVID-19 abstract
  • Figure 3: List of COVID-19 abstracts whose finding/contribution addresses the topic "patients, day, median, hospital, iqr, admission, range, died and years", ordered by date of publication
  • Figure 4: List of COVID-19 abstracts whose finding/contribution addresses the topic "human, protein, screening, ace2 expression, compounds, spike, active, proteins and screened", ordered by score
  • Figure 5: List of recommended papers similar to query text
  • ...and 3 more figures