Table of Contents
Fetching ...

Can Large Language Models Match the Conclusions of Systematic Reviews?

Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Yuhui Zhang, Kevin Wu, Serena Yeung-Levy

TL;DR

The paper introduces MedEvidence, a benchmark designed to test whether large language models can replicate the conclusions of expert-conducted medical systematic reviews when given the same underlying studies. Through a large-scale evaluation of 24 LLMs across zero-shot and expert-guided prompts, it demonstrates that reasoning capabilities, model size, and medical fine-tuning do not reliably improve performance, and that models exhibit overconfidence and a lack of skepticism toward lower-quality evidence. The authors detail a rigorous four-stage dataset curation pipeline using Cochrane SRs and PubMed sources, and provide extensive analysis on factors like evidence certainty and source concordance that influence model behavior. The findings suggest that current LLMs cannot yet match expert SR conclusions consistently, underscoring the need for further methodological advances and transparent benchmarking to safely deploy AI-assisted evidence synthesis in clinical settings. MedEvidence code and data are released to spur progress in this area.

Abstract

Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.

Can Large Language Models Match the Conclusions of Systematic Reviews?

TL;DR

The paper introduces MedEvidence, a benchmark designed to test whether large language models can replicate the conclusions of expert-conducted medical systematic reviews when given the same underlying studies. Through a large-scale evaluation of 24 LLMs across zero-shot and expert-guided prompts, it demonstrates that reasoning capabilities, model size, and medical fine-tuning do not reliably improve performance, and that models exhibit overconfidence and a lack of skepticism toward lower-quality evidence. The authors detail a rigorous four-stage dataset curation pipeline using Cochrane SRs and PubMed sources, and provide extensive analysis on factors like evidence certainty and source concordance that influence model behavior. The findings suggest that current LLMs cannot yet match expert SR conclusions consistently, underscoring the need for further methodological advances and transparent benchmarking to safely deploy AI-assisted evidence synthesis in clinical settings. MedEvidence code and data are released to spur progress in this area.

Abstract

Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.

Paper Structure

This paper contains 31 sections, 47 figures, 3 tables.

Figures (47)

  • Figure 1: Core skills evaluated by MedEvidence including: medical domain expertise across 10 different specialties, synthesizing conflicting evidence, and applying scientific skepticism when studies exhibit a high risk of bias (e.g. due to small sample sizes or insufficient supporting evidence).
  • Figure 2: Overview of the dataset curation process for MedEvidence.
  • Figure 3: Key statistical characteristics of the questions in MedEvidence. (a) shows the dataset distribution stratified by medical specialty. (b) presents the distribution stratified by outcome effect. (c) shows the distribution stratified by source concordance with the expert-assessed treatment outcome effect (i.e. the correct answer).
  • Figure 4: (a) Average model accuracy (and 95% CI) on MedEvidence, overlaid on the percentage of questions where the model provided valid output (additional details in Appendix Section \ref{['a:sec:instr-follow']}). (b) Average recall by ground truth treatment outcome effect, aggregated across all models (with overall 95% interval). Per-model average recall by treatment outcome effect can be found in Appendix Figure \ref{['fig:per-class-recall']}.
  • Figure 5: (a) Accuracy as a function of evidence certainty, shows a monotonically increasing trend. (b) Accuracy as a function of source concordance, defined as the percentage of relevant sources that agree with the final systematic review (SR) answer, also exhibits a monotonically increasing trend.
  • ...and 42 more figures