Table of Contents
Fetching ...

R2MED: A Benchmark for Reasoning-Driven Medical Retrieval

Lei Li, Xiao Zhou, Zheng Liu

TL;DR

Medical retrieval benchmarks have emphasized lexical similarity, leaving a gap for reasoning-driven tasks essential to clinical decision-making. R2MED introduces 876 queries across three reasoning-centric retrieval tasks and eight datasets spanning five clinical contexts and twelve body systems to evaluate and spur progress in reasoning-based medical retrieval. Comprehensive evaluations of 15 systems reveal a major gap, with the best vanilla retriever at 31.4 ndCG@10, while generation-augmented retrieval and large reasoning models push performance to around 41, illustrating both progress and remaining challenges. The dataset and code are released publicly to catalyze next-generation medical retrieval systems capable of leveraging explicit intermediate reasoning.

Abstract

Current medical retrieval benchmarks primarily emphasize lexical or shallow semantic similarity, overlooking the reasoning-intensive demands that are central to clinical decision-making. In practice, physicians often retrieve authoritative medical evidence to support diagnostic hypotheses. Such evidence typically aligns with an inferred diagnosis rather than the surface form of a patient's symptoms, leading to low lexical or semantic overlap between queries and relevant documents. To address this gap, we introduce R2MED, the first benchmark explicitly designed for reasoning-driven medical retrieval. It comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval. These tasks are drawn from five representative medical scenarios and twelve body systems, capturing the complexity and diversity of real-world medical information needs. We evaluate 15 widely-used retrieval systems on R2MED and find that even the best model achieves only 31.4 nDCG@10, demonstrating the benchmark's difficulty. Classical re-ranking and generation-augmented retrieval methods offer only modest improvements. Although large reasoning models improve performance via intermediate inference generation, the best results still peak at 41.4 nDCG@10. These findings underscore a substantial gap between current retrieval techniques and the reasoning demands of real clinical tasks. We release R2MED as a challenging benchmark to foster the development of next-generation medical retrieval systems with enhanced reasoning capabilities. Data and code are available at https://github.com/R2MED/R2MED

R2MED: A Benchmark for Reasoning-Driven Medical Retrieval

TL;DR

Medical retrieval benchmarks have emphasized lexical similarity, leaving a gap for reasoning-driven tasks essential to clinical decision-making. R2MED introduces 876 queries across three reasoning-centric retrieval tasks and eight datasets spanning five clinical contexts and twelve body systems to evaluate and spur progress in reasoning-based medical retrieval. Comprehensive evaluations of 15 systems reveal a major gap, with the best vanilla retriever at 31.4 ndCG@10, while generation-augmented retrieval and large reasoning models push performance to around 41, illustrating both progress and remaining challenges. The dataset and code are released publicly to catalyze next-generation medical retrieval systems capable of leveraging explicit intermediate reasoning.

Abstract

Current medical retrieval benchmarks primarily emphasize lexical or shallow semantic similarity, overlooking the reasoning-intensive demands that are central to clinical decision-making. In practice, physicians often retrieve authoritative medical evidence to support diagnostic hypotheses. Such evidence typically aligns with an inferred diagnosis rather than the surface form of a patient's symptoms, leading to low lexical or semantic overlap between queries and relevant documents. To address this gap, we introduce R2MED, the first benchmark explicitly designed for reasoning-driven medical retrieval. It comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval. These tasks are drawn from five representative medical scenarios and twelve body systems, capturing the complexity and diversity of real-world medical information needs. We evaluate 15 widely-used retrieval systems on R2MED and find that even the best model achieves only 31.4 nDCG@10, demonstrating the benchmark's difficulty. Classical re-ranking and generation-augmented retrieval methods offer only modest improvements. Although large reasoning models improve performance via intermediate inference generation, the best results still peak at 41.4 nDCG@10. These findings underscore a substantial gap between current retrieval techniques and the reasoning demands of real clinical tasks. We release R2MED as a challenging benchmark to foster the development of next-generation medical retrieval systems with enhanced reasoning capabilities. Data and code are available at https://github.com/R2MED/R2MED

Paper Structure

This paper contains 34 sections, 16 figures, 30 tables.

Figures (16)

  • Figure 1: Overview of R2MED. Subfigure (1) presents a comparison between R2MED and the previous benchmark (NFCorpus), highlighting the shift from semantic matching to reasoning-driven retrieval. Subfigures 2(a) and 2(b) show the performance of retrieval and reasoning models on R2MED, underscoring the limitations of existing retrievers when faced with reasoning-driven benchmarks.
  • Figure 2: R2MED benchmark construction pipeline.
  • Figure 3: Attribute distributions of R2MED showcase its diversity and comprehensiveness.
  • Figure 4: Average reranking performance on R2MED using three classic rerankers: MonoBERT, BGE-Reranker, and RankLLaMA. Detailed scores are in Table \ref{['tab:reranking_result']}.
  • Figure 5: Correlation between reasoning answer accuracy and retrieval performance.
  • ...and 11 more figures