Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

Mingchen Li; Zaifu Zhan; Han Yang; Yongkang Xiao; Jiatan Huang; Rui Zhang

Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

Mingchen Li, Zaifu Zhan, Han Yang, Yongkang Xiao, Jiatan Huang, Rui Zhang

TL;DR

This work introduces BioRAB, a comprehensive framework for evaluating retrieval-augmented large language models in the biomedical domain. It defines four testbeds—Unlabeled Robustness, Counterfactual Robustness, Diverse Robustness, and Negative Awareness—to probe robustness and self-awareness across five biomedical NLP tasks and 11 datasets, using multiple LLMs and retrievers. The authors show that RALs generally outperform standard LLMs but remain sensitive to retrieval quality, especially under counterfactual and diverse conditions, and that negative awareness is weak. To address these gaps, they propose Detect-and-Correct and a contrastive learning approach, which improve robustness on unlabeled and counterfactual data and enhance the model’s ability to detect and avoid incorrect retrievals. The work highlights key limitations of current RALs in high-stakes biomedical settings and provides practical methods to improve reliability and safety in real-world deployment.

Abstract

Large language models (LLM) have demonstrated remarkable capabilities in various biomedical natural language processing (NLP) tasks, leveraging the demonstration within the input context to adapt to new tasks. However, LLM is sensitive to the selection of demonstrations. To address the hallucination issue inherent in LLM, retrieval-augmented LLM (RAL) offers a solution by retrieving pertinent information from an established database. Nonetheless, existing research work lacks rigorous evaluation of the impact of retrieval-augmented large language models on different biomedical NLP tasks. This deficiency makes it challenging to ascertain the capabilities of RAL within the biomedical domain. Moreover, the outputs from RAL are affected by retrieving the unlabeled, counterfactual, or diverse knowledge that is not well studied in the biomedical domain. However, such knowledge is common in the real world. Finally, exploring the self-awareness ability is also crucial for the RAL system. So, in this paper, we systematically investigate the impact of RALs on 5 different biomedical tasks (triple extraction, link prediction, classification, question answering, and natural language inference). We analyze the performance of RALs in four fundamental abilities, including unlabeled robustness, counterfactual robustness, diverse robustness, and negative awareness. To this end, we proposed an evaluation framework to assess the RALs' performance on different biomedical NLP tasks and establish four different testbeds based on the aforementioned fundamental abilities. Then, we evaluate 3 representative LLMs with 3 different retrievers on 5 tasks over 9 datasets.

Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

TL;DR

Abstract

Paper Structure (44 sections, 1 equation, 2 figures, 13 tables)

This paper contains 44 sections, 1 equation, 2 figures, 13 tables.

Introduction
Results
Results of RALs and backbone LLMs
Results of Testbed 1, 2 and 3
Results of Testbed4: Negative Awareness
Results of our method on unlabeled database
Results of our method on counterfactual database
Results of our method on Awareness
Sampling bias
Error Analysis
Testbed 1
Testbed2
Testbed3
Discussion
Testbeds and our methods
...and 29 more sections

Figures (2)

Figure 1: BIORAB features on queries on different types corpus to test the awareness ability and generation ability of RAL.
Figure 2: Overview of four testbeds on BIORAB. $n$ refers to the special dataset for each task, such as ade-corpus-v2 (text classification), and PHharmKG (link prediction). In (d), the corpus of $n$ refers to the set that includes the task datasets but excludes the training set of $n$. In (e), to distinguish the difference between "Output" and "True/False", the "Output" is defined as the expected output for different tasks, for example, in the triple extraction task, the output is the triple. "True/False" refers to "the retrieved example is a negative example or the retrieved example is not a negative example." In our work, the n corpus of n refers to the training set of n.

Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

TL;DR

Abstract

Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness

Authors

TL;DR

Abstract

Table of Contents

Figures (2)