Table of Contents
Fetching ...

xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization

Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, Matthieu-P. Schapranow

TL;DR

This work tackles cross-lingual medical entity normalization (MEN) in languages with varying resource availability. It introduces xMEN, a modular Python toolkit that combines unsupervised cross-lingual candidate generation (TF-IDF with character n-grams and SapBERT) with trainable cross-encoder re-ranking, augmented by a rank-regularization loss, and supports weakly supervised training via machine-translation–based label projection (EasyProject). The approach demonstrates state-of-the-art performance across multiple multilingual benchmarks, aided by a BigBIO-compatible data interface and reproducible evaluation pipelines. Practically, xMEN enables robust MEN across languages with limited supervision and provides a clear path for extending to additional KBs and languages while maintaining reproducibility.

Abstract

Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English. Materials and Methods: We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given terminology, we leverage English aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder model if annotations for the target task are available. We also evaluate cross-encoders trained in a weakly supervised manner based on machine-translated datasets from a high resource domain. Our system is publicly available as an extensible Python toolkit. Results: xMEN improves the state-of-the-art performance across a wide range of multilingual benchmark datasets. Weakly supervised cross-encoders are effective when no training data is available for the target task. Through the compatibility of xMEN with the BigBIO framework, it can be easily used with existing and prospective datasets. Discussion: Our experiments show the importance of balancing the output of general-purpose candidate generators with subsequent trainable re-rankers, which we achieve through a rank regularization term in the loss function of the cross-encoder. However, error analysis reveals that multi-word expressions and other complex entities are still challenging. Conclusion: xMEN exhibits strong performance for medical entity normalization in multiple languages, even when no labeled data and few terminology aliases for the target language are available. Its configuration system and evaluation modules enable reproducible benchmarks. Models and code are available online at the following URL: https://github.com/hpi-dhc/xmen

xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization

TL;DR

This work tackles cross-lingual medical entity normalization (MEN) in languages with varying resource availability. It introduces xMEN, a modular Python toolkit that combines unsupervised cross-lingual candidate generation (TF-IDF with character n-grams and SapBERT) with trainable cross-encoder re-ranking, augmented by a rank-regularization loss, and supports weakly supervised training via machine-translation–based label projection (EasyProject). The approach demonstrates state-of-the-art performance across multiple multilingual benchmarks, aided by a BigBIO-compatible data interface and reproducible evaluation pipelines. Practically, xMEN enables robust MEN across languages with limited supervision and provides a clear path for extending to additional KBs and languages while maintaining reproducibility.

Abstract

Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English. Materials and Methods: We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given terminology, we leverage English aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder model if annotations for the target task are available. We also evaluate cross-encoders trained in a weakly supervised manner based on machine-translated datasets from a high resource domain. Our system is publicly available as an extensible Python toolkit. Results: xMEN improves the state-of-the-art performance across a wide range of multilingual benchmark datasets. Weakly supervised cross-encoders are effective when no training data is available for the target task. Through the compatibility of xMEN with the BigBIO framework, it can be easily used with existing and prospective datasets. Discussion: Our experiments show the importance of balancing the output of general-purpose candidate generators with subsequent trainable re-rankers, which we achieve through a rank regularization term in the loss function of the cross-encoder. However, error analysis reveals that multi-word expressions and other complex entities are still challenging. Conclusion: xMEN exhibits strong performance for medical entity normalization in multiple languages, even when no labeled data and few terminology aliases for the target language are available. Its configuration system and evaluation modules enable reproducible benchmarks. Models and code are available online at the following URL: https://github.com/hpi-dhc/xmen
Paper Structure (31 sections, 2 equations, 3 figures, 5 tables)

This paper contains 31 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the modular architecture of xMEN, along with a normalization example of the mention "lupus". xMEN can be used with any BigBIO-compatible dataset and implements different approaches for candidate generation and ranking, pre- and post-processing steps, evaluation metrics and utilities for error analysis. In addition, different knowledge bases can be quickly integrated and indexed via the configuration system and command line interface (CLI).
  • Figure 2: Impact of the relative weight $\lambda$ of the rank regularization term in the CE loss function. We report the test set recall@$k$ for different values of $k$ in a) for Quaero and b) for DisTEMIST. For each value of $\lambda$, we report the mean and standard deviation across three runs with different random seeds. Note that the y-axes have different intervals, as the baseline performance is higher for Quaero.
  • Figure 3: Impact of mention length and lexical ambiguity on the absolute number of true positives (for $k=1$) before and after re-ranking with the fully supervised CE. The "Total" line refers to the total number of concepts in the gold-standard. The number of shared aliases in the right column is the maximum number of aliases that any concept in the candidate lists shares with the ground truth concept. When zero aliases are shared, this means that the correct concept was not among the retrieved candidates, therefore the number of true positives is also zero, before and after re-ranking.