Table of Contents
Fetching ...

A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks

Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, Jimmy Huang

TL;DR

It is found that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art models when they were fine-tuned only on the training set of these datasets, suggesting that pre-training on large text corpora makes LLMs quite specialized even in the biomedical domain.

Abstract

Recently, Large Language Models (LLM) have demonstrated impressive capability to solve a wide range of tasks. However, despite their success across various tasks, no prior work has investigated their capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of LLMs on benchmark biomedical tasks. For this purpose, we conduct a comprehensive evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets. To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art fine-tuned biomedical models. This suggests that pretraining on large text corpora makes LLMs quite specialized even in the biomedical domain. We also find that not a single LLM can outperform other LLMs in all tasks, with the performance of different LLMs may vary depending on the task. While their performance is still quite poor in comparison to the biomedical models that were fine-tuned on large training sets, our findings demonstrate that LLMs have the potential to be a valuable tool for various biomedical tasks that lack large annotated data.

A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks

TL;DR

It is found that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art models when they were fine-tuned only on the training set of these datasets, suggesting that pre-training on large text corpora makes LLMs quite specialized even in the biomedical domain.

Abstract

Recently, Large Language Models (LLM) have demonstrated impressive capability to solve a wide range of tasks. However, despite their success across various tasks, no prior work has investigated their capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of LLMs on benchmark biomedical tasks. For this purpose, we conduct a comprehensive evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets. To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art fine-tuned biomedical models. This suggests that pretraining on large text corpora makes LLMs quite specialized even in the biomedical domain. We also find that not a single LLM can outperform other LLMs in all tasks, with the performance of different LLMs may vary depending on the task. While their performance is still quite poor in comparison to the biomedical models that were fine-tuned on large training sets, our findings demonstrate that LLMs have the potential to be a valuable tool for various biomedical tasks that lack large annotated data.
Paper Structure (47 sections, 1 equation, 3 figures, 18 tables)

This paper contains 47 sections, 1 equation, 3 figures, 18 tables.

Figures (3)

  • Figure 1: An overview of our methodology to evaluate 6 biomedical tasks across 26 datasets in this paper. At first, we construct the prompt for each dataset. Then, we generate the response for each dataset using respective LLMs. Finally, depending on the task, we apply various evaluation techniques.
  • Figure 2: Confusion Matrix for different models in the PubMedQA dataset.
  • Figure 3: Confusion Matrix for different models in the MediQA-2019 dataset.