Table of Contents
Fetching ...

Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li

Abstract

Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open-source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents' ability to produce end-to-end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at https://github.com/somewordstoolate/RWE-bench.

Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

Abstract

Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open-source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents' ability to produce end-to-end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at https://github.com/somewordstoolate/RWE-bench.
Paper Structure (44 sections, 5 equations, 19 figures, 12 tables)

This paper contains 44 sections, 5 equations, 19 figures, 12 tables.

Figures (19)

  • Figure 1: Overview of RWE-bench construction, execution, and evaluation. We curate a collection of peer-reviewed studies as benchmark tasks and require agents to reproduce the protocol-specified analyses on a real database. Agents submit answers in a hierarchical format to form an evidence bundle. We evaluate performance at both the question level and the task level, supplemented by an automatic verification of the generated cohorts.
  • Figure 2: Dataset statistics. (a) Protocol length (measured in word count) versus field count distribution. (b) Field type composition.
  • Figure 3: Success rates across task groups stratified by field count. The x-axis groups tasks by the number of fields, with labels indicating the corresponding intervals.
  • Figure 4: Screening results of RWEAgent cohorts. (a) Counts of qualified cohorts versus cohorts filtered out by rule-based checks or by the LLM judge. (b) Change in SR after cohort screening.
  • Figure 5: Search strategy. It was performed on PubMed with a search date prior to August 19, 2025.
  • ...and 14 more figures