Table of Contents
Fetching ...

BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation

Samuele Garda, Ulf Leser

TL;DR

BELHD tackles the critical problem of homonym-induced ambiguity in biomedical entity linking by introducing a KB preprocessing step that disambiguates homonyms with entity-specific strings and a novel candidate-sharing training strategy. Built atop BioSyn, BELHD preserves a name-based linking paradigm while leveraging contextual information and cross-mention signals to improve recall@1, achieving state-of-the-art results on six of ten BELB corpora with an average gain of 4.55pp. Importantly, the homonym disambiguation component is modular and improves other name-based methods (e.g., GenBioEL), indicating broad applicability beyond BELHD itself. The approach enhances practical BEL performance in real pipelines and offers a scalable solution for homonym-rich KBs like UMLS and NCBI Gene.

Abstract

Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). A popular approach to the task are name-based methods, i.e. those identifying the most appropriate name in the KB for a given mention, either via dense retrieval or autoregressive modeling. However, as these methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance, especially for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). We therefore present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. Specifically, BELHD builds upon the BioSyn (Sung et al.,2020) model introducing two crucial extensions. First, it performs a preprocessing of the KB in which it expands homonyms with an automatically chosen disambiguating string, thus enforcing unique linking decisions. Second, we introduce candidate sharing, a novel strategy to select candidates for contrastive learning that enhances the overall training signal. Experiments with 10 corpora and five entity types show that BELHD improves upon state-of-the-art approaches, achieving the best results in 6 out 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the core prediction model and thus can also improve other methods, which we exemplify for GenBioEL (Yuan et al, 2022), a generative name-based BEL approach. Code is available at: link added upon publication.

BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation

TL;DR

BELHD tackles the critical problem of homonym-induced ambiguity in biomedical entity linking by introducing a KB preprocessing step that disambiguates homonyms with entity-specific strings and a novel candidate-sharing training strategy. Built atop BioSyn, BELHD preserves a name-based linking paradigm while leveraging contextual information and cross-mention signals to improve recall@1, achieving state-of-the-art results on six of ten BELB corpora with an average gain of 4.55pp. Importantly, the homonym disambiguation component is modular and improves other name-based methods (e.g., GenBioEL), indicating broad applicability beyond BELHD itself. The approach enhances practical BEL performance in real pipelines and offers a scalable solution for homonym-rich KBs like UMLS and NCBI Gene.

Abstract

Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). A popular approach to the task are name-based methods, i.e. those identifying the most appropriate name in the KB for a given mention, either via dense retrieval or autoregressive modeling. However, as these methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance, especially for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). We therefore present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. Specifically, BELHD builds upon the BioSyn (Sung et al.,2020) model introducing two crucial extensions. First, it performs a preprocessing of the KB in which it expands homonyms with an automatically chosen disambiguating string, thus enforcing unique linking decisions. Second, we introduce candidate sharing, a novel strategy to select candidates for contrastive learning that enhances the overall training signal. Experiments with 10 corpora and five entity types show that BELHD improves upon state-of-the-art approaches, achieving the best results in 6 out 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the core prediction model and thus can also improve other methods, which we exemplify for GenBioEL (Yuan et al, 2022), a generative name-based BEL approach. Code is available at: link added upon publication.
Paper Structure (21 sections, 2 equations, 3 figures, 8 tables)

This paper contains 21 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Illustration of entity-based (a) and name-based (b) approaches to biomedical entity linking. Underlined text highlights the KB homonym (Section \ref{['sec:homonyms']}) preventing a unique linking decision (b). In (c) we show how in BELHD we address the issue by replacing homonyms their disambiguated version. Text in blue and red represent the correct and wrong prediction, respectively.
  • Figure 2: Illustration of our Homonym Disambiguation approach for biomedical KBs.
  • Figure 3: Overview of a BELHD training step with candidate sharing.