Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models

Minbyul Jeong; Jiwoong Sohn; Mujeen Sung; Jaewoo Kang

Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models

Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, Jaewoo Kang

TL;DR

This work tackles the challenge of domain-generalization in retrieval-augmented biomedical reasoning by introducing Self-BioRAG, a domain-specific framework that performs on-demand retrieval and self-reflection to generate explanations. It combines a biomedical retriever (MedCPT), a critic language model to assess reflective tokens, and a domain-tuned generator to produce answers grounded in retrieved evidence and encoded knowledge. Training uses 120k biomedical instructions, filtered to 84k, enabling two specialized LMs (critic and generator) trained for robust domain execution. Empirically, Self-BioRAG yields a 7.2% absolute improvement over open-foundation models at 7B or smaller on benchmark biomedical QA tasks, with analyses showing the critical role of domain-specific data, retrieval strategies, and reflective control. The work releases data, code, and weights to promote further advances in biomedical and clinical NLP applications.

Abstract

Recent proprietary large language models (LLMs), such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains.

Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models

TL;DR

Abstract

Paper Structure (39 sections, 7 equations, 7 figures, 16 tables)

This paper contains 39 sections, 7 equations, 7 figures, 16 tables.

Introduction
Background
Proprietary & Open Language Models
Learning with Reward Strategy
Retrieval-Augmented Generation
Self-BioRAG
Biomedical Instruction Datasets
List of Instruction Datasets for Biomedical and Clinical Domains.
Biomedical Retriever
Self-Reflection Language Model (Critic Language Model)
Data Construction of Critic LM $C$.
Process of Training Critic LM $C$.
Annotating Biomedical Instruction Sets Using Critic LM $C$.
Domain-Specific Instruction-Tuned Language Model (Generator Language Model)
Data Construction Using Critic LM $C$ & Training Generator LM $M$.
...and 24 more sections

Figures (7)

Figure 1: Comparison between three frameworks: generation using language model (LM), retrieval-augmented generation (RAG) using LM, and our Self-BioRAG. (A) depicts the process of sequence-to-sequence generation of LM. (B) The RAG framework first finds relevant documents from large-scale corpus such as PubMed Central and then provides the answer based on this factual content to address the shortage of scarce knowledge. (C) Initially, our domain-specific instruction-tuned model predicts whether retrieval is necessary. If a query doesn't require any retrieval of knowledge (factual content), it directly predicts the answer. However, if the query necessitates retrieval knowledge, Self-BioRAG utilizes the domain-specific retriever (MedCPT, in our case) to retrieve relevant documents. After retrieving the top-$k$ evidence, the model selects the most pertinent evidence for the query. Ultimately, our language model is employed to select the best evidence and generate the answer based on the selected evidence and encoded knowledge.
Figure 2: Overview of our Self-BioRAG process: data construction, training, and inference of Self-Reflection Language Model (critic LM $C$) and Domain-specific Instruction-tuned Language Model (generator LM $M$). We construct 120k biomedical instruction sets using two off-the-shelf instruction sets (Mol-Instructions fang2023mol and MedInstruct zhang2023alpacare) and one self-generated biomedical instruction set. We first sample 5k instructions to generate reflective tokens via GPT-4 API calls and then train the critic LM $C$ with these instructions. Using trained critic LM $C$, we filter out mispredicted reflective tokens, such as [Continue Generation]. We preserve 84k instruction sets annotated with pre-defined reflective tokens to train the generator LM $M$. Note that critic LM $C$ is only used for annotating reflective tokens used to filter instruction sets to train generator LM $M$. After training, the model $M$ can predict whether or not to use the retrieval method and combine the results of evidence and encoded knowledge to answer the question. We use the MedQA jin2021disease test sample to gain a proper understanding of how our Self-BioRAG works.
Figure 3: Ratio of retrieved evidences from each of the four biomedical corpora (PubMed, PMC, CPG, Medical Textbook). The RAG statistics refer to the top-1 evidence usage ratio, while Self-BioRAG selects the most useful evidence from the top-10 retrieved evidence.
Figure 4: Performance of LLaMA2, RAG, and Self-BioRAG on examples split into [No Retrieval] and [Retrieval] based on Self-BioRAG using the MedQA test dataset.
Figure 5: Nested piechart of our biomedical instructions root verbs (inner circle) and their four noun objects (outer circle). It signifies the diversity of generated instructions and relates to biomedical terms such as hypothesis, proteins, diagnosis, and symptoms.
...and 2 more figures

Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models

TL;DR

Abstract

Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)