Accelerating Clinical Evidence Synthesis with Large Language Models

Zifeng Wang; Lang Cao; Benjamin Danek; Qiao Jin; Zhiyong Lu; Jimeng Sun

Accelerating Clinical Evidence Synthesis with Large Language Models

Zifeng Wang, Lang Cao, Benjamin Danek, Qiao Jin, Zhiyong Lu, Jimeng Sun

TL;DR

A generative artificial intelligence pipeline named TrialMind is proposed to streamline study search, study screening, and data extraction tasks in SR to show the promise of accelerating clinical evidence synthesis driven by human-AI collaboration.

Abstract

Synthesizing clinical evidence largely relies on systematic reviews of clinical trials and retrospective analyses from medical literature. However, the rapid expansion of publications presents challenges in efficiently identifying, summarizing, and updating clinical evidence. Here, we introduce TrialMind, a generative artificial intelligence (AI) pipeline for facilitating human-AI collaboration in three crucial tasks for evidence synthesis: study search, screening, and data extraction. To assess its performance, we chose published systematic reviews to build the benchmark dataset, named TrialReviewBench, which contains 100 systematic reviews and the associated 2,220 clinical studies. Our results show that TrialMind excels across all three tasks. In study search, it generates diverse and comprehensive search queries to achieve high recall rates (Ours 0.711-0.834 v.s. Human baseline 0.138-0.232). For study screening, TrialMind surpasses traditional embedding-based methods by 30% to 160%. In data extraction, it outperforms a GPT-4 baseline by 29.6% to 61.5%. We further conducted user studies to confirm its practical utility. Compared to manual efforts, human-AI collaboration using TrialMind yielded a 71.4% recall lift and 44.2% time savings in study screening and a 23.5% accuracy lift and 63.4% time savings in data extraction. Additionally, when comparing synthesized clinical evidence presented in forest plots, medical experts favored TrialMind's outputs over GPT-4's outputs in 62.5% to 100% of cases. These findings show the promise of LLM-based approaches like TrialMind to accelerate clinical evidence synthesis via streamlining study search, screening, and data extraction from medical literature, with exceptional performance improvement when working with human experts.

Accelerating Clinical Evidence Synthesis with Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 16 figures)

This paper contains 16 sections, 7 equations, 16 figures.

Database search and initial filtering
Refinement
Manual screening of titles and abstracts
In-context learning
Retrieval-augmented generation
Chain-of-thought
LLM-driven pipeline
LLMs
Research question inputs
Literature search
Study screening
Data extraction
Result extraction
Literature search and screening
Data extraction and result extraction
...and 1 more sections

Figures (16)

Figure 1: The overview of TrialMind pipeline. a, it has four main steps: literature search, literature screening, data extraction, and evidence synthesis. b, (1) Utilizing input PICO elements, TrialMind generates key terms to construct Boolean queries for retrieving studies from literature databases. (2) TrialMind formulates eligibility criteria, which users can edit to provide context for LLMs during eligibility predictions. Users can then select studies based on these predictions and rank their relevance by aggregating them. (3) TrialMind processes the descriptions of target data fields to extract and output the required information as structured data. (4) TrialMind extracts findings from the studies and collaborates with users to synthesize the clinical evidence.
Figure 1: The study design compares the synthesized clinical evidence from the baseline and TrialMind via human evaluation.
Figure 2: Literature search experiment results.a, The total number of involved studies and the number of review papers across different topics. b, The TrialMind's interface for users to retrieve studies. c, the Recall of the search results for reviews across four topics. The bar heights indicate the Recall, and the star indicates the number of studies found. d, Scatter plots of the Recall against the number of ground-truth studies. Each scatter indicates the results of one review. Regression estimates are displayed with the 95% CIs in blue or purple. e, Example cases comparing the outputs of three methods.
Figure 2: The flowchart of the screening process of meta-analyses involved in the TrialReviewBench dataset.
Figure 3: Literature screen experiment results. a, Streamline study screening using TrialMind with human in the loop. b, Ranking performances for Recall@20/50 within across therapeutic areas. c, Recall@20 and Recall@50 for TrialMind and selected baselines. d, Effect of individual criterion on the ranking results. e, Ranking performance for $\text{Recall}\xspace@K$ with varying $K$ in four topics. Shaded areas are $95\%$ confidence interval.
...and 11 more figures

Accelerating Clinical Evidence Synthesis with Large Language Models

TL;DR

Abstract

Accelerating Clinical Evidence Synthesis with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)