On-the-fly Definition Augmentation of LLMs for Biomedical NER
Monica Munnangi, Sergey Feldman, Byron C Wallace, Silvio Amir, Tom Hope, Aakanksha Naik
TL;DR
This work tackles the limited performance of large language models on biomedical NER by introducing on-the-fly definition augmentation, which injects definitional knowledge for relevant concepts at inference time to enable revision of extractions. The authors systematically compare zero-shot and few-shot prompting strategies across six BigBIO datasets, using both open and closed LLMs, and find that definition augmentation can yield substantial gains, particularly for Llama 2 and GPT-4, with domain-specific sources like UMLS providing the strongest improvements. They validate the approach through extensive ablations, showing that gains hinge on the relevance of the provided definitions and that simple integration with a strong knowledge base can outperform a baseline fine-tuned model in few-shot settings. The work contributes a practical, reproducible framework for rapid domain adaptation of LLMs to biomedical NER and suggests that carefully curated definitional knowledge can generalize to other domains with limited labeled data.
Abstract
Despite their general capabilities, LLMs still struggle on biomedical NER tasks, which are difficult due to the presence of specialized terminology and lack of training data. In this work we set out to improve LLM performance on biomedical NER in limited data settings via a new knowledge augmentation approach which incorporates definitions of relevant concepts on-the-fly. During this process, to provide a test bed for knowledge augmentation, we perform a comprehensive exploration of prompting strategies. Our experiments show that definition augmentation is useful for both open source and closed LLMs. For example, it leads to a relative improvement of 15\% (on average) in GPT-4 performance (F1) across all (six) of our test datasets. We conduct extensive ablations and analyses to demonstrate that our performance improvements stem from adding relevant definitional knowledge. We find that careful prompting strategies also improve LLM performance, allowing them to outperform fine-tuned language models in few-shot settings. To facilitate future research in this direction, we release our code at https://github.com/allenai/beacon.
