Table of Contents
Fetching ...

ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish

Fernando Gallego, Guillermo López-García, Luis Gasco-Sánchez, Martin Krallinger, Francisco J. Veredas

TL;DR

ClinLinker tackles Spanish medical entity linking by a two-stage pipeline: a Spanish SapBERT-based bi-encoder for candidate retrieval followed by a cross-encoder re-ranking, trained exclusively on Spanish UMLS concepts (including SNOMED-CT). The approach leverages corpus-specific data (DisTEMIST and MedProcNER) and FAISS-based nearest-neighbor retrieval to outperform multilingual baselines on disease and procedure mentions, achieving substantial gains on both gold-standard and unseen-code subsets. The results demonstrate the importance of language-specific adaptation for clinical NLP and show strong generalization to new codes, with potential to scale to broader clinical variables and integrate with knowledge graphs. Overall, ClinLinker establishes a new benchmark for Spanish MEL and offers a practical pathway for structuring large-scale EHR-derived data in non-English clinical domains.

Abstract

Advances in natural language processing techniques, such as named entity recognition and normalization to widely used standardized terminologies like UMLS or SNOMED-CT, along with the digitalization of electronic health records, have significantly advanced clinical text analysis. This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking that leverages the potential of in-domain adapted language models for biomedical text mining: initial candidate retrieval using a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish. This methodology, focused initially on content in Spanish, substantially outperforming multilingual language models designed for the same purpose. This is true even for complex scenarios involving heterogeneous medical terminologies and being trained on a subset of the original data. Our results, evaluated using top-k accuracy at 25 and other top-k metrics, demonstrate our approach's performance on two distinct clinical entity linking Gold Standard corpora, DisTEMIST (diseases) and MedProcNER (clinical procedures), outperforming previous benchmarks by 40 points in DisTEMIST and 43 points in MedProcNER, both normalized to SNOMED-CT codes. These findings highlight our approach's ability to address language-specific nuances and set a new benchmark in entity linking, offering a potent tool for enhancing the utility of digital medical records. The resulting system is of practical value, both for large scale automatic generation of structured data derived from clinical records, as well as for exhaustive extraction and harmonization of predefined clinical variables of interest.

ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish

TL;DR

ClinLinker tackles Spanish medical entity linking by a two-stage pipeline: a Spanish SapBERT-based bi-encoder for candidate retrieval followed by a cross-encoder re-ranking, trained exclusively on Spanish UMLS concepts (including SNOMED-CT). The approach leverages corpus-specific data (DisTEMIST and MedProcNER) and FAISS-based nearest-neighbor retrieval to outperform multilingual baselines on disease and procedure mentions, achieving substantial gains on both gold-standard and unseen-code subsets. The results demonstrate the importance of language-specific adaptation for clinical NLP and show strong generalization to new codes, with potential to scale to broader clinical variables and integrate with knowledge graphs. Overall, ClinLinker establishes a new benchmark for Spanish MEL and offers a practical pathway for structuring large-scale EHR-derived data in non-English clinical domains.

Abstract

Advances in natural language processing techniques, such as named entity recognition and normalization to widely used standardized terminologies like UMLS or SNOMED-CT, along with the digitalization of electronic health records, have significantly advanced clinical text analysis. This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking that leverages the potential of in-domain adapted language models for biomedical text mining: initial candidate retrieval using a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish. This methodology, focused initially on content in Spanish, substantially outperforming multilingual language models designed for the same purpose. This is true even for complex scenarios involving heterogeneous medical terminologies and being trained on a subset of the original data. Our results, evaluated using top-k accuracy at 25 and other top-k metrics, demonstrate our approach's performance on two distinct clinical entity linking Gold Standard corpora, DisTEMIST (diseases) and MedProcNER (clinical procedures), outperforming previous benchmarks by 40 points in DisTEMIST and 43 points in MedProcNER, both normalized to SNOMED-CT codes. These findings highlight our approach's ability to address language-specific nuances and set a new benchmark in entity linking, offering a potent tool for enhancing the utility of digital medical records. The resulting system is of practical value, both for large scale automatic generation of structured data derived from clinical records, as well as for exhaustive extraction and harmonization of predefined clinical variables of interest.
Paper Structure (9 sections, 2 equations, 4 figures, 2 tables)

This paper contains 9 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: DisTEMIST-linking subtrack: requires automatically finding disease mentions in published clinical cases and assigning, to each mention, a SNOMED-CT term.
  • Figure 2: ClinLinker's two-stage pipeline for MEL: a first stage of candidate retrieval, using a bi-encoder, and subsequent stage of re-ranking, employing a cross-encoder.
  • Figure 3: Performance comparison of the bi-encoder and bi-encoder+cross-encoder ("_CE") models on the DisTEMIST corpus: efficacy of the models across various retrieval thresholds (top-k accuracy) for both the validated gold-standard annotations and the unseen-codes subsets. (Note that the figures are in different scales to show the differences between the performance of the different models in both subsets.)
  • Figure 4: Performance comparison of the bi-encoder and bi-encoder+cross-encoder ("_CE") models on the MedProcNER corpus: efficacy of the models across various retrieval thresholds (top-k accuracy) for both the validated gold-standard annotations and the unseen-codes subsets. (Note that the figures are in different scales to show the differences between the performance of the different models in both subsets.)