Table of Contents
Fetching ...

A foundation model for human-AI collaboration in medical literature mining

Zifeng Wang, Lang Cao, Qiao Jin, Joey Chan, Nicholas Wan, Behdad Afzali, Hyun-Jin Cho, Chang-In Choi, Mehdi Emamverdi, Manjot K. Gill, Sun-Hyung Kim, Yijia Li, Yi Liu, Hanley Ong, Justin Rousseau, Irfan Sheikh, Jenny J. Wei, Ziyang Xu, Christopher M. Zallek, Kyungsang Kim, Yifan Peng, Zhiyong Lu, Jimeng Sun

TL;DR

LEADS, an AI foundation model for study search, screening, and data extraction from medical literature, demonstrates consistent improvements over four cutting-edge generic large language models (LLMs) on six tasks and enhances expert workflows by providing supportive references following expert requests, streamlining processes while maintaining high-quality results.

Abstract

Systematic literature review is essential for evidence-based medicine, requiring comprehensive analysis of clinical trial publications. However, the application of artificial intelligence (AI) models for medical literature mining has been limited by insufficient training and evaluation across broad therapeutic areas and diverse tasks. Here, we present LEADS, an AI foundation model for study search, screening, and data extraction from medical literature. The model is trained on 633,759 instruction data points in LEADSInstruct, curated from 21,335 systematic reviews, 453,625 clinical trial publications, and 27,015 clinical trial registries. We showed that LEADS demonstrates consistent improvements over four cutting-edge generic large language models (LLMs) on six tasks. Furthermore, LEADS enhances expert workflows by providing supportive references following expert requests, streamlining processes while maintaining high-quality results. A study with 16 clinicians and medical researchers from 14 different institutions revealed that experts collaborating with LEADS achieved a recall of 0.81 compared to 0.77 experts working alone in study selection, with a time savings of 22.6%. In data extraction tasks, experts using LEADS achieved an accuracy of 0.85 versus 0.80 without using LEADS, alongside a 26.9% time savings. These findings highlight the potential of specialized medical literature foundation models to outperform generic models, delivering significant quality and efficiency benefits when integrated into expert workflows for medical literature mining.

A foundation model for human-AI collaboration in medical literature mining

TL;DR

LEADS, an AI foundation model for study search, screening, and data extraction from medical literature, demonstrates consistent improvements over four cutting-edge generic large language models (LLMs) on six tasks and enhances expert workflows by providing supportive references following expert requests, streamlining processes while maintaining high-quality results.

Abstract

Systematic literature review is essential for evidence-based medicine, requiring comprehensive analysis of clinical trial publications. However, the application of artificial intelligence (AI) models for medical literature mining has been limited by insufficient training and evaluation across broad therapeutic areas and diverse tasks. Here, we present LEADS, an AI foundation model for study search, screening, and data extraction from medical literature. The model is trained on 633,759 instruction data points in LEADSInstruct, curated from 21,335 systematic reviews, 453,625 clinical trial publications, and 27,015 clinical trial registries. We showed that LEADS demonstrates consistent improvements over four cutting-edge generic large language models (LLMs) on six tasks. Furthermore, LEADS enhances expert workflows by providing supportive references following expert requests, streamlining processes while maintaining high-quality results. A study with 16 clinicians and medical researchers from 14 different institutions revealed that experts collaborating with LEADS achieved a recall of 0.81 compared to 0.77 experts working alone in study selection, with a time savings of 22.6%. In data extraction tasks, experts using LEADS achieved an accuracy of 0.85 versus 0.80 without using LEADS, alongside a 26.9% time savings. These findings highlight the potential of specialized medical literature foundation models to outperform generic models, delivering significant quality and efficiency benefits when integrated into expert workflows for medical literature mining.

Paper Structure

This paper contains 18 sections, 3 equations, 23 figures.

Figures (23)

  • Figure 1: Overview of LEADS and LEADSInstruct.a, LEADSInstruct consists of 20K+ systematic reviews, 453K+ publications, and 27K+ clinical trials linked across data sources. A hybrid approach is adopted to transform the linked data into instruction data covering six tasks in literature mining. b, Bar plot showing the number of reviews covering different conditions. c, Bar plot showing the number of reviews covering different interventions. d, Comparative performance analysis contrasting LEADS with cutting-edge proprietary AI and open-source AI models. The evaluation metrics include Recall for search query generation, Recall@50 for study eligibility assessment, and Accuracy for the remaining tasks. e, Density plot of the number of tokens in the inputs and outputs of the instruction datasets. f, Illustration of the experimental setups. g, Illustration of the user study setup.
  • Figure 1: The forms shared with experts to complete the pilot user study for study screening.a, the Expert-only arm where one needs to find eligible studies from a randomly shuffled list of candidates and submit the results with the time spent. b, the Expert+AI arm where one needs to find eligible studies referring to the AI eligibility assessment results. c, the AI assessment results that participants read when making the decisions. The studies are ranked by the overall scores, with the predictions and rationale breaking down for each PICO element.
  • Figure 2: LEADS performs literature search tasks.a, Illustration of how LEADS receives the research question definition, performs search query generation, and retrieves citations from the literature. b, Distribution of the condition topics of the reviews and involved citations in the dataset. c, Search query generation performance of LEADS and the leading models, in terms of the Recall achieved by the identified studies. The information in parentheses indicates the performance change of baselines compared to LEADS or LEADS compared to the best baseline in the same task. d, Topic-wise comparison of LEADS to GPT-4o in terms of the Recall yielded by the generated search query. LEADS + ensemble indicates an ensembling of multiple search queries. e, Performance of LEADS and GPT-4o regarding the varied number of target studies for each review. The error bar indicates 95% confidence interval, omitted when the sample size is smaller than ten.
  • Figure 2: The forms shared with experts to complete the pilot user study for data extraction. a, the Expert-only arm where one needs to follow the definition of the target field and extract the results from the raw study document. b, the Expert+AI arm, where one can refer to AI extraction results to extract the target field values from the raw study document.
  • Figure 3: LEADS performs citation screening tasks. a, Radar plot of Recall@50, comparing LEADS to cutting-edge LLMs and dense retrieval across various review condition topics. b, Recall performance of LEADS comparing to other LLMs and dense retrieval. The information in parentheses indicates the performance change of baselines compared to LEADS. c, Performance of LEADS and baselines regarding the varied number of target studies for each review. d, Illustration of how LEADS receives the study inclusion and exclusion criteria defined for target PICO elements, makes eligibility prediction, and ranks the target studies.
  • ...and 18 more figures