Table of Contents
Fetching ...

Boosting Biomedical Concept Extraction by Rule-Based Data Augmentation

Qiwei Shao, Fengran Mo, Jian-Yun Nie

TL;DR

This work tackles the scarcity of labeled data and non-canonical naming in document-level biomedical concept extraction by leveraging MetaMapLite to generate pseudo-annotated data. It introduces a retrieval-based augmentation pipeline with candidate document retrieval, rule-based pseudo-annotation, and post-annotation filters, combined with a bi-encoder trained via contrastive learning on mixed manual and augmented data using an InfoNCE objective. Results on BC5CDR and NCBI-Disease show improved performance, especially for rare and non-canonical concepts, with insights on how augmentation quantity, annotation quality, and filtering affect outcomes; the approach also demonstrates complementary gains when combined with SapBERT. Overall, the method provides a practical data-augmentation pathway that enhances context-aware concept extraction in biomedical texts, with implications for scaling concept coverage in clinical and literature search applications.

Abstract

Document-level biomedical concept extraction is the task of identifying biomedical concepts mentioned in a given document. Recent advancements have adapted pre-trained language models for this task. However, the scarcity of domain-specific data and the deviation of concepts from their canonical names often hinder these models' effectiveness. To tackle this issue, we employ MetaMapLite, an existing rule-based concept mapping system, to generate additional pseudo-annotated data from PubMed and PMC. The annotated data are used to augment the limited training data. Through extensive experiments, this study demonstrates the utility of a manually crafted concept mapping tool for training a better concept extraction model.

Boosting Biomedical Concept Extraction by Rule-Based Data Augmentation

TL;DR

This work tackles the scarcity of labeled data and non-canonical naming in document-level biomedical concept extraction by leveraging MetaMapLite to generate pseudo-annotated data. It introduces a retrieval-based augmentation pipeline with candidate document retrieval, rule-based pseudo-annotation, and post-annotation filters, combined with a bi-encoder trained via contrastive learning on mixed manual and augmented data using an InfoNCE objective. Results on BC5CDR and NCBI-Disease show improved performance, especially for rare and non-canonical concepts, with insights on how augmentation quantity, annotation quality, and filtering affect outcomes; the approach also demonstrates complementary gains when combined with SapBERT. Overall, the method provides a practical data-augmentation pathway that enhances context-aware concept extraction in biomedical texts, with implications for scaling concept coverage in clinical and literature search applications.

Abstract

Document-level biomedical concept extraction is the task of identifying biomedical concepts mentioned in a given document. Recent advancements have adapted pre-trained language models for this task. However, the scarcity of domain-specific data and the deviation of concepts from their canonical names often hinder these models' effectiveness. To tackle this issue, we employ MetaMapLite, an existing rule-based concept mapping system, to generate additional pseudo-annotated data from PubMed and PMC. The annotated data are used to augment the limited training data. Through extensive experiments, this study demonstrates the utility of a manually crafted concept mapping tool for training a better concept extraction model.
Paper Structure (31 sections, 6 equations, 5 figures, 7 tables)

This paper contains 31 sections, 6 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Concept extraction model with data augmentation. The concept Hepatic copper accumulation gets additional training data, and helps the extraction of this concept during inference.
  • Figure 2: F1-score with different entity occurrence threshold $k$ on NCBI-Disease (a) and BC5CDR (b). The solid curves show the F1-scores with different augmentation quantities, and the dotted lines show the baseline F1-scores without augmentation.
  • Figure 3: F1 with different augmented weight on NCBI-Disease (a) and BC5CDR (b).
  • Figure 4: F1-score of models with different filters on NCBI (a) and BC5CDR (b).
  • Figure 5: Example of non-canonical concept prediction output. Underscored text indicates canonical concept mentions. Italic text indicates non-canonical concept mentions. (bracketed) text indicates concept IDs.