Table of Contents
Fetching ...

Medical Concept Normalization in a Low-Resource Setting

Tim Patzelt

TL;DR

This work tackles medical concept normalization (MCN) in a low-resource setting by creating TLC-UMLS, a German lay-language dataset annotated with UMLS CUIs. It evaluates multilingual transformer approaches (notably SapBERTXLMR with Self-Alignment Pre-training) against a strong string-based baseline, finding that embedding-based methods outperform while contextual information, as implemented, offered limited gains. A comprehensive error analysis identifies frequent misclassifications and categories, guiding potential improvements such as semantic-type conditioning and refined evaluation metrics. The study demonstrates the viability of multilingual MCN for German lay texts and provides a valuable resource and methodological insights for future cross-lingual, low-resource MCN research.

Abstract

In the field of biomedical natural language processing, medical concept normalization is a crucial task for accurately mapping mentions of concepts to a large knowledge base. However, this task becomes even more challenging in low-resource settings, where limited data and resources are available. In this thesis, I explore the challenges of medical concept normalization in a low-resource setting. Specifically, I investigate the shortcomings of current medical concept normalization methods applied to German lay texts. Since there is no suitable dataset available, a dataset consisting of posts from a German medical online forum is annotated with concepts from the Unified Medical Language System. The experiments demonstrate that multilingual Transformer-based models are able to outperform string similarity methods. The use of contextual information to improve the normalization of lay mentions is also examined, but led to inferior results. Based on the results of the best performing model, I present a systematic error analysis and lay out potential improvements to mitigate frequent errors.

Medical Concept Normalization in a Low-Resource Setting

TL;DR

This work tackles medical concept normalization (MCN) in a low-resource setting by creating TLC-UMLS, a German lay-language dataset annotated with UMLS CUIs. It evaluates multilingual transformer approaches (notably SapBERTXLMR with Self-Alignment Pre-training) against a strong string-based baseline, finding that embedding-based methods outperform while contextual information, as implemented, offered limited gains. A comprehensive error analysis identifies frequent misclassifications and categories, guiding potential improvements such as semantic-type conditioning and refined evaluation metrics. The study demonstrates the viability of multilingual MCN for German lay texts and provides a valuable resource and methodological insights for future cross-lingual, low-resource MCN research.

Abstract

In the field of biomedical natural language processing, medical concept normalization is a crucial task for accurately mapping mentions of concepts to a large knowledge base. However, this task becomes even more challenging in low-resource settings, where limited data and resources are available. In this thesis, I explore the challenges of medical concept normalization in a low-resource setting. Specifically, I investigate the shortcomings of current medical concept normalization methods applied to German lay texts. Since there is no suitable dataset available, a dataset consisting of posts from a German medical online forum is annotated with concepts from the Unified Medical Language System. The experiments demonstrate that multilingual Transformer-based models are able to outperform string similarity methods. The use of contextual information to improve the normalization of lay mentions is also examined, but led to inferior results. Based on the results of the best performing model, I present a systematic error analysis and lay out potential improvements to mitigate frequent errors.
Paper Structure (57 sections, 12 equations, 14 figures, 5 tables)

This paper contains 57 sections, 12 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: The WordPiece tokenization process with an example sentence. Taken from a https://ai.googleblog.com/2021/12/a-fast-wordpiece-tokenization-system.html.
  • Figure 2: The Transformer architecture. Taken from vaswani_attention_2017.
  • Figure 3: The pre-training setup for BERT. The position of the [CLS] and [SEP] and both training objectives are shown. Taken from devlin_bert_2019.
  • Figure 4: Cosine similarity scores of pairs of embeddings produced by PubMedBERT. The examples were randomly samples from UMLS. Positive pairs consist of names that belong to the same concept and negative pairs consist of names from two different concepts. The left graph shows examples of embeddings that are already well-separated in the embedding space. The right diagram shows examples with a high overlap between positive and negative samples. These are the hard examples which are selected by the hard pairs mining sampling. Taken from liu_self-alignment_2021.
  • Figure 5: The Sentence Cross-Encoder architecture with a BERT encoder to calculate similarity scores. The weights of the two BERT model are tied (siamese network). Taken from reimers_sentence-bert_2019.
  • ...and 9 more figures