UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for Biomedical Entity Recognition
Aidan Mannion, Thierry Chevalier, Didier Schwab, Lorraine Geouriot
TL;DR
This work tackles the challenge of infusing structured biomedical knowledge into transformer-based language models. It introduces UMLS-KGI-BERT, a data-centric pre-training framework that adds UMLS-derived sequences and three KG reasoning tasks—entity prediction, link prediction, and triple classification—to the standard masked-language objective, enabling joint learning from knowledge graphs and free text. Across French, Spanish, and English, the approach yields improvements on multiple biomedical NER benchmarks and demonstrates competitive performance with established baselines while requiring less data. The authors release model weights and a Python toolkit, highlighting the practical impact for multilingual clinical NLP and providing a pathway to more knowledge-informed language models in biomedicine.
Abstract
Pre-trained transformer language models (LMs) have in recent years become the dominant paradigm in applied NLP. These models have achieved state-of-the-art performance on tasks such as information extraction, question answering, sentiment analysis, document classification and many others. In the biomedical domain, significant progress has been made in adapting this paradigm to NLP tasks that require the integration of domain-specific knowledge as well as statistical modelling of language. In particular, research in this area has focused on the question of how best to construct LMs that take into account not only the patterns of token distribution in medical text, but also the wealth of structured information contained in terminology resources such as the UMLS. This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS. This allows for graph-based learning objectives to be combined with masked-language pre-training. Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks.
