Table of Contents
Fetching ...

Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

Hyunjae Kim, Jiwoong Sohn, Aidan Gilson, Nicholas Cochran-Caggiano, Serina Applebaum, Heeju Jin, Seihee Park, Yujin Park, Jiyeong Park, Seoyoung Choi, Brittany Alexandra Herrera Contreras, Thomas Huang, Jaehoon Yun, Ethan F. Wei, Roy Jiang, Leah Colucci, Eric Lai, Amisha Dave, Tuo Guo, Maxwell B. Singer, Yonghoe Koo, Ron A. Adelman, James Zou, Andrew Taylor, Arman Cohan, Hua Xu, Qingyu Chen

TL;DR

This work provides the first large-scale, expert-driven, stage-wise evaluation of retrieval-augmented generation in medicine. By dissecting retrieval, evidence usage, and generation, the authors reveal that standard RAG often degrades factuality and completeness, driven by poor retrieval quality and weak evidence integration, with significant variability across tasks and models. The study introduces practical mitigations—evidence filtering and query reformulation—that yield robust, task-dependent gains, especially on challenging medical QA benchmarks. The findings advocate for stage-aware evaluation and selective RAG deployment, and they supply a rich, expert-annotated dataset to support future development and benchmarking of medical RAG systems.

Abstract

Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.

Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

TL;DR

This work provides the first large-scale, expert-driven, stage-wise evaluation of retrieval-augmented generation in medicine. By dissecting retrieval, evidence usage, and generation, the authors reveal that standard RAG often degrades factuality and completeness, driven by poor retrieval quality and weak evidence integration, with significant variability across tasks and models. The study introduces practical mitigations—evidence filtering and query reformulation—that yield robust, task-dependent gains, especially on challenging medical QA benchmarks. The findings advocate for stage-aware evaluation and selective RAG deployment, and they supply a rich, expert-annotated dataset to support future development and benchmarking of medical RAG systems.

Abstract

Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.

Paper Structure

This paper contains 40 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Study design and evaluation framework. Our fine-grained framework decomposes the RAG pipeline into three components, evidence retrieval, evidence selection, and response generation, enabling systematic evaluation of each stage. First, retrieved passages are annotated for relevance (evidence retrieval; $n =$30,800). Next, model responses are evaluated to determine whether they are grounded in relevant evidence (evidence selection; $n =$26,032). Finally, responses are broken down into individual statements and assessed for correctness (factuality; $n =$15,970) and coverage of required information (completeness; $n =$7,700). In total, our framework comprises 80,502 expert annotations, enabling rigorous error attribution across the full RAG pipeline.
  • Figure 2: Evidence retrieval performance across different evaluation metrics and query types.a, Precision@k: proportion of relevant passages among the top-k; higher is better. b, Miss@k: proportion of queries with no relevant passage in the top-k; lower is better. c, Coverage@k: proportion of must-have statements supported by the top-k; higher is better.
  • Figure 3: Analysis of citation types and evidence selection performance.a, Average number of references per query, categorized by evidence source (retrieval-based vs. self-generated); retrieval-based references are further broken down by relevance. Self-generated references refer to citations produced by the RAG model itself that do not appear among the retrieved passages. b, Precision and recall for identifying relevant evidence among retrieved passages.
  • Figure 4: Factuality and completeness of model responses.a, Average factuality scores at the response and statement levels across queries. b, Statement-level factuality broken down by the type of evidence cited: relevant retrieved (True Positive), irrelevant retrieved (False Positive), self-generated, or none. c, Average completeness scores at the response and statement levels, based on coverage of must-have statements. d, Completeness for must-have statements, categorized by whether the supporting evidence was retrieved and cited (Supported & Referenced), retrieved but not cited (Supported but Missed), or not retrieved at all (Unsupported).
  • Figure 5: Performance of RAG variants and non-RAG baselines across five QA datasets. MedQA jin2021disease, MMLU hendrycks2021measuring, MMLU-Pro wang2024mmlu, MedMCQA pal2022medmcqa, and MedXPertQA zuo2025medxpertqa were used. a, Results using Llama-3.1-8B as the base model. b, Results using GPT-4o as the base model. Cell colors indicate the magnitude and direction of accuracy changes relative to the base LLM: sky-blue denotes performance gains, while red indicates performance drops. Four RAG configurations are compared: (i) standard RAG (retrieval-only), (ii) retrieval + evidence filtering, (iii) retrieval + query reformulation, and (iv) retrieval + both evidence filtering and query reformulation, evaluated at different top-k settings ($\text{k}=1, 2, 4, 8, 16, 32$). Each cell displays the accuracy of the model, while the color represents the relative gain compared to the base LLM. Positive gains (blue) indicate improved performance over the base LLM, while negative values (red) indicate performance drops. Base LLM accuracies are shown beneath each block for reference. The best scores are highlighted in bold.
  • ...and 1 more figures