Table of Contents
Fetching ...

Do LLMs Surpass Encoders for Biomedical NER?

Motasem S Obeidat, Md Sultan Al Nahian, Ramakanth Kavuluru

TL;DR

This study rigorously compares encoder-based models (BERT and variants) with decoder-based LLMs (Mistral and Llama) for biomedical NER under the BIO tagging scheme across five datasets with varying proportions of long entities. The analysis demonstrates that LLMs often yield higher F1 scores, especially for longer spans, but incur significantly higher inference times and hardware costs, potentially limiting practical use. While encoders perform strongly on short entities and in low-latency contexts, LLMs offer robust performance on complex, multi-token entities and may be preferable when marginal gains justify the expense. The findings guide practitioners in choosing between encoder and decoder architectures and suggest potential hybrid or distilled approaches for scalable, high-performance biomedical NER.

Abstract

Recognizing spans of biomedical concepts and their types (e.g., drug or gene) in free text, often called biomedical named entity recognition (NER), is a basic component of information extraction (IE) pipelines. Without a strong NER component, other applications, such as knowledge discovery and information retrieval, are not practical. State-of-the-art in NER shifted from traditional ML models to deep neural networks with transformer-based encoder models (e.g., BERT) emerging as the current standard. However, decoder models (also called large language models or LLMs) are gaining traction in IE. But LLM-driven NER often ignores positional information due to the generative nature of decoder models. Furthermore, they are computationally very expensive (both in inference time and hardware needs). Hence, it is worth exploring if they actually excel at biomedical NER and assess any associated trade-offs (performance vs efficiency). This is exactly what we do in this effort employing the same BIO entity tagging scheme (that retains positional information) using five different datasets with varying proportions of longer entities. Our results show that the LLMs chosen (Mistral and Llama: 8B range) often outperform best encoder models (BERT-(un)cased, BiomedBERT, and DeBERTav3: 300M range) by 2-8% in F-scores except for one dataset, where they equal encoder performance. This gain is more prominent among longer entities of length >= 3 tokens. However, LLMs are one to two orders of magnitude more expensive at inference time and may need cost prohibitive hardware. Thus, when performance differences are small or real time user feedback is needed, encoder models might still be more suitable than LLMs.

Do LLMs Surpass Encoders for Biomedical NER?

TL;DR

This study rigorously compares encoder-based models (BERT and variants) with decoder-based LLMs (Mistral and Llama) for biomedical NER under the BIO tagging scheme across five datasets with varying proportions of long entities. The analysis demonstrates that LLMs often yield higher F1 scores, especially for longer spans, but incur significantly higher inference times and hardware costs, potentially limiting practical use. While encoders perform strongly on short entities and in low-latency contexts, LLMs offer robust performance on complex, multi-token entities and may be preferable when marginal gains justify the expense. The findings guide practitioners in choosing between encoder and decoder architectures and suggest potential hybrid or distilled approaches for scalable, high-performance biomedical NER.

Abstract

Recognizing spans of biomedical concepts and their types (e.g., drug or gene) in free text, often called biomedical named entity recognition (NER), is a basic component of information extraction (IE) pipelines. Without a strong NER component, other applications, such as knowledge discovery and information retrieval, are not practical. State-of-the-art in NER shifted from traditional ML models to deep neural networks with transformer-based encoder models (e.g., BERT) emerging as the current standard. However, decoder models (also called large language models or LLMs) are gaining traction in IE. But LLM-driven NER often ignores positional information due to the generative nature of decoder models. Furthermore, they are computationally very expensive (both in inference time and hardware needs). Hence, it is worth exploring if they actually excel at biomedical NER and assess any associated trade-offs (performance vs efficiency). This is exactly what we do in this effort employing the same BIO entity tagging scheme (that retains positional information) using five different datasets with varying proportions of longer entities. Our results show that the LLMs chosen (Mistral and Llama: 8B range) often outperform best encoder models (BERT-(un)cased, BiomedBERT, and DeBERTav3: 300M range) by 2-8% in F-scores except for one dataset, where they equal encoder performance. This gain is more prominent among longer entities of length >= 3 tokens. However, LLMs are one to two orders of magnitude more expensive at inference time and may need cost prohibitive hardware. Thus, when performance differences are small or real time user feedback is needed, encoder models might still be more suitable than LLMs.

Paper Structure

This paper contains 12 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Sample prompt for the JNLPBA dataset for LLM driven NER