Table of Contents
Fetching ...

LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

Aishik Nagar, Viktor Schlegel, Thanh-Tung Nguyen, Hao Li, Yuping Wu, Kuluhan Binici, Stefan Winkler

TL;DR

The paper addresses the problem of evaluating true zero-shot capabilities of LLMs for biomedical structured prediction, specifically classification and NER, where annotated data is scarce. It adopts a systematic approach comparing vanilla prompting, CoT, SC, and RAG (with PubMed and Wikipedia) using constrained decoding on open-source models (BioMistral, Llama-2 variants) across 14 classification and 12 NER datasets, including a two-stage NER pipeline. The key findings are that standard prompting consistently outperforms advanced prompting and retrieval techniques, model size is the main driver of zero-shot performance, and approaches like CoT, SC, and RAG do not reliably improve results for structured biomedical outputs. This highlights the need for more effective grounding and integration of external knowledge into LLMs to support reliable biomedical information extraction in real-world settings.

Abstract

Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extraction. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs' task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs - including BioMistral and Llama-2 models - on a diverse set of biomedical datasets, using standard prompting, Chain of-Thought (CoT) and Self Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.

LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

TL;DR

The paper addresses the problem of evaluating true zero-shot capabilities of LLMs for biomedical structured prediction, specifically classification and NER, where annotated data is scarce. It adopts a systematic approach comparing vanilla prompting, CoT, SC, and RAG (with PubMed and Wikipedia) using constrained decoding on open-source models (BioMistral, Llama-2 variants) across 14 classification and 12 NER datasets, including a two-stage NER pipeline. The key findings are that standard prompting consistently outperforms advanced prompting and retrieval techniques, model size is the main driver of zero-shot performance, and approaches like CoT, SC, and RAG do not reliably improve results for structured biomedical outputs. This highlights the need for more effective grounding and integration of external knowledge into LLMs to support reliable biomedical information extraction in real-world settings.

Abstract

Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extraction. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs' task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs - including BioMistral and Llama-2 models - on a diverse set of biomedical datasets, using standard prompting, Chain of-Thought (CoT) and Self Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.
Paper Structure (18 sections, 5 figures, 4 tables)

This paper contains 18 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Best-performing Standard Prompting method for BioMistral 7B, Llama- 70B and Llama- 7B for all classification tasks.
  • Figure 2: Best-performing Standard Prompting method for BioMistral 7B, Llama- 70B and Llama- 7B for all NER tasks.
  • Figure 3: Performance comparison for BioMistral 7B, Llama 7B and Llama 70B on single- and multi-label datasets, with random guess baselines of 0.415 and 0.215, respectively.
  • Figure 4: Breakdown of the Micro-F1 performance of each technique and the random guess baseline for all classification datasets, compared against the random guess baseline.
  • Figure 5: Breakdown of each technique and the random guess baseline on all NER datasets as measured by the Micro-F1 scores. A prediction is counted as correct when both the span and its assigned label are found in the ground truth