Table of Contents
Fetching ...

GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition

Anthony Yazdani, Ihor Stepanov, Douglas Teodoro

TL;DR

GLiNER-BioMed tackles open biomedical NER by moving beyond fixed entity taxonomies through natural-language label descriptions and zero-shot recognition. It distills annotations from large LLMs into a compact model to generate a high-coverage synthetic biomedical pre-training dataset, then fine-tunes on a diverse post-training corpus using uni- and bi-encoder GLiNER architectures at multiple scales. Across eight biomedical benchmarks, it achieves a notable zero-shot improvement (5.96 F1 points) and strong few-shot performance, with the bi-encoder variant delivering substantial throughput advantages in low-data scenarios. The work provides an open-source pipeline and data resource, enabling practical deployment while acknowledging synthetic-data biases and computational costs as areas for future improvement.

Abstract

Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types. To address these issues, we introduce GLiNER-BioMed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedicine. In contrast to conventional approaches, GLiNER uses natural language labels to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Experiments on several biomedical datasets demonstrate that GLiNER-BioMed outperforms the state-of-the-art in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline (p-value < 0.001). Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on general-domain annotations. All datasets, models, and training pipelines are publicly available at https://github.com/ds4dh/GLiNER-biomed.

GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition

TL;DR

GLiNER-BioMed tackles open biomedical NER by moving beyond fixed entity taxonomies through natural-language label descriptions and zero-shot recognition. It distills annotations from large LLMs into a compact model to generate a high-coverage synthetic biomedical pre-training dataset, then fine-tunes on a diverse post-training corpus using uni- and bi-encoder GLiNER architectures at multiple scales. Across eight biomedical benchmarks, it achieves a notable zero-shot improvement (5.96 F1 points) and strong few-shot performance, with the bi-encoder variant delivering substantial throughput advantages in low-data scenarios. The work provides an open-source pipeline and data resource, enabling practical deployment while acknowledging synthetic-data biases and computational costs as areas for future improvement.

Abstract

Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types. To address these issues, we introduce GLiNER-BioMed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedicine. In contrast to conventional approaches, GLiNER uses natural language labels to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Experiments on several biomedical datasets demonstrate that GLiNER-BioMed outperforms the state-of-the-art in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline (p-value < 0.001). Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on general-domain annotations. All datasets, models, and training pipelines are publicly available at https://github.com/ds4dh/GLiNER-biomed.

Paper Structure

This paper contains 33 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the synthetic pre-training data generation pipeline for GLiNER-BioMed (Section \ref{['pre_training_dataset']}). The pipeline begins with data collection, involving corpus selection, quality filtering, deduplication, and stratified sampling to produce a 115k passage corpus. This curated corpus is then annotated using a model distillation strategy where OpenBioLLM-70B (teacher) annotates an initial 10k samples, which are used to train a smaller OpenBioLLM-8B (student) model using low-rank adaptation. The distilled student model then efficiently annotates the remaining 105k passages for the final pre-training dataset.
  • Figure 2: Peak inference throughput (words/second) for GLiNER-BioMed large models, illustrating architectural efficiency under varied entity type loads. Peak words/second is the maximum achieved over batch sizes going from 1 to 64 on an NVIDIA RTX3090 GPU before out-of-memory using FP32 precision. Performance of uni-encoder (blue) and bi-encoder (orange) is compared using (1) dataset-specific entity labels (solid bars), and (2) a fixed set of 127 UMLS semantic types (hatched bars). Annotated percentages indicate the throughput difference between bi- and uni-encoders.
  • Figure S1: TF-IDF similarity graph for biomedical passages. Blue nodes represent retained representatives; red nodes indicate excluded duplicates.
  • Figure S2: UMLS semantic groups by mention count (log scale) in the synthetic pre-training dataset. Counts reflect total NER label–to–semantic group mappings; labels associated with multiple semantic groups contribute to multiple bars.
  • Figure S3: Top 15 WordNet lexnames by mention count (log scale) in the post-training dataset. Each mention is assigned a single lexname based on its most frequent WordNet sense.