Boosting Biomedical Concept Extraction by Rule-Based Data Augmentation
Qiwei Shao, Fengran Mo, Jian-Yun Nie
TL;DR
This work tackles the scarcity of labeled data and non-canonical naming in document-level biomedical concept extraction by leveraging MetaMapLite to generate pseudo-annotated data. It introduces a retrieval-based augmentation pipeline with candidate document retrieval, rule-based pseudo-annotation, and post-annotation filters, combined with a bi-encoder trained via contrastive learning on mixed manual and augmented data using an InfoNCE objective. Results on BC5CDR and NCBI-Disease show improved performance, especially for rare and non-canonical concepts, with insights on how augmentation quantity, annotation quality, and filtering affect outcomes; the approach also demonstrates complementary gains when combined with SapBERT. Overall, the method provides a practical data-augmentation pathway that enhances context-aware concept extraction in biomedical texts, with implications for scaling concept coverage in clinical and literature search applications.
Abstract
Document-level biomedical concept extraction is the task of identifying biomedical concepts mentioned in a given document. Recent advancements have adapted pre-trained language models for this task. However, the scarcity of domain-specific data and the deviation of concepts from their canonical names often hinder these models' effectiveness. To tackle this issue, we employ MetaMapLite, an existing rule-based concept mapping system, to generate additional pseudo-annotated data from PubMed and PMC. The annotated data are used to augment the limited training data. Through extensive experiments, this study demonstrates the utility of a manually crafted concept mapping tool for training a better concept extraction model.
