Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

Hyunjae Kim; Jiwoong Sohn; Aidan Gilson; Nicholas Cochran-Caggiano; Serina Applebaum; Heeju Jin; Seihee Park; Yujin Park; Jiyeong Park; Seoyoung Choi; Brittany Alexandra Herrera Contreras; Thomas Huang; Jaehoon Yun; Ethan F. Wei; Roy Jiang; Leah Colucci; Eric Lai; Amisha Dave; Tuo Guo; Maxwell B. Singer; Yonghoe Koo; Ron A. Adelman; James Zou; Andrew Taylor; Arman Cohan; Hua Xu; Qingyu Chen

Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

Hyunjae Kim, Jiwoong Sohn, Aidan Gilson, Nicholas Cochran-Caggiano, Serina Applebaum, Heeju Jin, Seihee Park, Yujin Park, Jiyeong Park, Seoyoung Choi, Brittany Alexandra Herrera Contreras, Thomas Huang, Jaehoon Yun, Ethan F. Wei, Roy Jiang, Leah Colucci, Eric Lai, Amisha Dave, Tuo Guo, Maxwell B. Singer, Yonghoe Koo, Ron A. Adelman, James Zou, Andrew Taylor, Arman Cohan, Hua Xu, Qingyu Chen

TL;DR

This work provides the first large-scale, expert-driven, stage-wise evaluation of retrieval-augmented generation in medicine. By dissecting retrieval, evidence usage, and generation, the authors reveal that standard RAG often degrades factuality and completeness, driven by poor retrieval quality and weak evidence integration, with significant variability across tasks and models. The study introduces practical mitigations—evidence filtering and query reformulation—that yield robust, task-dependent gains, especially on challenging medical QA benchmarks. The findings advocate for stage-aware evaluation and selective RAG deployment, and they supply a rich, expert-annotated dataset to support future development and benchmarking of medical RAG systems.

Abstract

Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.

Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

TL;DR

Abstract

Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)