Table of Contents
Fetching ...

On-the-fly Definition Augmentation of LLMs for Biomedical NER

Monica Munnangi, Sergey Feldman, Byron C Wallace, Silvio Amir, Tom Hope, Aakanksha Naik

TL;DR

This work tackles the limited performance of large language models on biomedical NER by introducing on-the-fly definition augmentation, which injects definitional knowledge for relevant concepts at inference time to enable revision of extractions. The authors systematically compare zero-shot and few-shot prompting strategies across six BigBIO datasets, using both open and closed LLMs, and find that definition augmentation can yield substantial gains, particularly for Llama 2 and GPT-4, with domain-specific sources like UMLS providing the strongest improvements. They validate the approach through extensive ablations, showing that gains hinge on the relevance of the provided definitions and that simple integration with a strong knowledge base can outperform a baseline fine-tuned model in few-shot settings. The work contributes a practical, reproducible framework for rapid domain adaptation of LLMs to biomedical NER and suggests that carefully curated definitional knowledge can generalize to other domains with limited labeled data.

Abstract

Despite their general capabilities, LLMs still struggle on biomedical NER tasks, which are difficult due to the presence of specialized terminology and lack of training data. In this work we set out to improve LLM performance on biomedical NER in limited data settings via a new knowledge augmentation approach which incorporates definitions of relevant concepts on-the-fly. During this process, to provide a test bed for knowledge augmentation, we perform a comprehensive exploration of prompting strategies. Our experiments show that definition augmentation is useful for both open source and closed LLMs. For example, it leads to a relative improvement of 15\% (on average) in GPT-4 performance (F1) across all (six) of our test datasets. We conduct extensive ablations and analyses to demonstrate that our performance improvements stem from adding relevant definitional knowledge. We find that careful prompting strategies also improve LLM performance, allowing them to outperform fine-tuned language models in few-shot settings. To facilitate future research in this direction, we release our code at https://github.com/allenai/beacon.

On-the-fly Definition Augmentation of LLMs for Biomedical NER

TL;DR

This work tackles the limited performance of large language models on biomedical NER by introducing on-the-fly definition augmentation, which injects definitional knowledge for relevant concepts at inference time to enable revision of extractions. The authors systematically compare zero-shot and few-shot prompting strategies across six BigBIO datasets, using both open and closed LLMs, and find that definition augmentation can yield substantial gains, particularly for Llama 2 and GPT-4, with domain-specific sources like UMLS providing the strongest improvements. They validate the approach through extensive ablations, showing that gains hinge on the relevance of the provided definitions and that simple integration with a strong knowledge base can outperform a baseline fine-tuned model in few-shot settings. The work contributes a practical, reproducible framework for rapid domain adaptation of LLMs to biomedical NER and suggests that carefully curated definitional knowledge can generalize to other domains with limited labeled data.

Abstract

Despite their general capabilities, LLMs still struggle on biomedical NER tasks, which are difficult due to the presence of specialized terminology and lack of training data. In this work we set out to improve LLM performance on biomedical NER in limited data settings via a new knowledge augmentation approach which incorporates definitions of relevant concepts on-the-fly. During this process, to provide a test bed for knowledge augmentation, we perform a comprehensive exploration of prompting strategies. Our experiments show that definition augmentation is useful for both open source and closed LLMs. For example, it leads to a relative improvement of 15\% (on average) in GPT-4 performance (F1) across all (six) of our test datasets. We conduct extensive ablations and analyses to demonstrate that our performance improvements stem from adding relevant definitional knowledge. We find that careful prompting strategies also improve LLM performance, allowing them to outperform fine-tuned language models in few-shot settings. To facilitate future research in this direction, we release our code at https://github.com/allenai/beacon.
Paper Structure (30 sections, 13 figures, 17 tables)

This paper contains 30 sections, 13 figures, 17 tables.

Figures (13)

  • Figure 1: Illustration of our approach using a zero-shot example, with incorrect extraction (red) and correct extraction (green) when provided with the definition of the extracted entity (yellow).
  • Figure 2: Definition relevance ablations with GPT-4 on CDR dataset (top-left), NCBI (top-right) and Llama 2 on MEDM dataset (bottom-left) and CHIA (bottom-right). We see similar trends across all models and datasets - a consistent decrease in performance with less relevant definitions.
  • Figure 3: F1 score plotted against the number of shots in few-shot setting. Performance of all models tends to increase with the number of shots (except for NCBI and MEDM datasets where we observe minor fluctuations in performance).
  • Figure 4: Zero-shot Prompt with text input and JSON output
  • Figure 5: Zero-shot Prompt with schema def input and JSON output
  • ...and 8 more figures