Table of Contents
Fetching ...

MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking

Nishant Mishra, Wilker Aziz, Iacer Calixto

TL;DR

MedPath tackles semantic fragmentation, explainability, and semantically-blind evaluation in biomedical entity linking by introducing a large, multi-domain EL dataset. It harmonizes nine expert corpora, normalizes all entities to UMLS CUIs, and provides cross-vocabulary mappings to up to 62 vocabularies along with full hierarchical paths for 11 vocabularies. The work also introduces hierarchy-aware evaluation metrics and demonstrates initial retrieval, reranking, and evaluation results that reveal substantial benefits from using a cross-domain, semantically enriched benchmark. This resource enables training of more interpretable and interoperable clinical NLP models and supports broader evaluation of biomedical EL across diverse vocabularies and knowledge graphs.

Abstract

Progress in biomedical Named Entity Recognition (NER) and Entity Linking (EL) is currently hindered by a fragmented data landscape, a lack of resources for building explainable models, and the limitations of semantically-blind evaluation metrics. To address these challenges, we present MedPath, a large-scale and multi-domain biomedical EL dataset that builds upon nine existing expert-annotated EL datasets. In MedPath, all entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies and, crucially, 3) enriched with full ontological paths -- i.e., from general to specific -- in up to 11 biomedical vocabularies. MedPath directly enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems, and the development of the next generation of interoperable and explainable clinical NLP models.

MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking

TL;DR

MedPath tackles semantic fragmentation, explainability, and semantically-blind evaluation in biomedical entity linking by introducing a large, multi-domain EL dataset. It harmonizes nine expert corpora, normalizes all entities to UMLS CUIs, and provides cross-vocabulary mappings to up to 62 vocabularies along with full hierarchical paths for 11 vocabularies. The work also introduces hierarchy-aware evaluation metrics and demonstrates initial retrieval, reranking, and evaluation results that reveal substantial benefits from using a cross-domain, semantically enriched benchmark. This resource enables training of more interpretable and interoperable clinical NLP models and supports broader evaluation of biomedical EL across diverse vocabularies and knowledge graphs.

Abstract

Progress in biomedical Named Entity Recognition (NER) and Entity Linking (EL) is currently hindered by a fragmented data landscape, a lack of resources for building explainable models, and the limitations of semantically-blind evaluation metrics. To address these challenges, we present MedPath, a large-scale and multi-domain biomedical EL dataset that builds upon nine existing expert-annotated EL datasets. In MedPath, all entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies and, crucially, 3) enriched with full ontological paths -- i.e., from general to specific -- in up to 11 biomedical vocabularies. MedPath directly enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems, and the development of the next generation of interoperable and explainable clinical NLP models.

Paper Structure

This paper contains 68 sections, 17 figures, 7 tables.

Figures (17)

  • Figure 1: MedPath creation process. For illustration purposes, we show one example from two different datasets, and vocabulary mappings and path annotations for only one of the concepts, e.g., C0013144 Drowsiness (situation).
  • Figure 2: Semantic type distribution in MedPath.
  • Figure 3: Vocabulary overlap heat map. Datasets' annotations using UMLS are not shown.
  • Figure 4: Histogram of lengths of entity hierarchical paths across different vocabularies.
  • Figure 5: Figure showing EL performance in the three data settings
  • ...and 12 more figures