Table of Contents
Fetching ...

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

Yanis Labrak, Mickael Rouvier, Richard Dufour

TL;DR

This work assesses how well instruction-tuned large language models can perform a wide range of English clinical and biomedical NLP tasks in zero- and few-shot settings, comparing them against a domain-specific baseline (PubMedBERT). By introducing a semantic-prompt retrieval strategy and the Recursive Chain-of-Thought approach for NER, the study demonstrates that ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca can approach state-of-the-art performance on many tasks, especially QA, while remaining behind task-specific models on CLS and RE. The results highlight a practical path for medical NLP: open, instruction-tuned LLMs can deliver strong zero-/few-shot performance, with additional gains from task-tailored prompting and reasoning strategies, though cost and privacy considerations favor domain-specific BERT-style models in some settings. The work provides actionable insights into prompt design, evaluation of generative outputs, and task coverage across clinical and biomedical NLP, illustrating the trade-offs between generic LLMs and specialized medical models for real-world deployment.

Abstract

We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

TL;DR

This work assesses how well instruction-tuned large language models can perform a wide range of English clinical and biomedical NLP tasks in zero- and few-shot settings, comparing them against a domain-specific baseline (PubMedBERT). By introducing a semantic-prompt retrieval strategy and the Recursive Chain-of-Thought approach for NER, the study demonstrates that ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca can approach state-of-the-art performance on many tasks, especially QA, while remaining behind task-specific models on CLS and RE. The results highlight a practical path for medical NLP: open, instruction-tuned LLMs can deliver strong zero-/few-shot performance, with additional gains from task-tailored prompting and reasoning strategies, though cost and privacy considerations favor domain-specific BERT-style models in some settings. The work provides actionable insights into prompt design, evaluation of generative outputs, and task coverage across clinical and biomedical NLP, illustrating the trade-offs between generic LLMs and specialized medical models for real-world deployment.

Abstract

We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.
Paper Structure (43 sections, 13 tables)