Closing the gap between open-source and commercial large language models for medical evidence summarization

Gongbo Zhang; Qiao Jin; Yiliang Zhou; Song Wang; Betina R. Idnay; Yiming Luo; Elizabeth Park; Jordan G. Nestor; Matthew E. Spotnitz; Ali Soroush; Thomas Campion; Zhiyong Lu; Chunhua Weng; Yifan Peng

Closing the gap between open-source and commercial large language models for medical evidence summarization

Gongbo Zhang, Qiao Jin, Yiliang Zhou, Song Wang, Betina R. Idnay, Yiming Luo, Elizabeth Park, Jordan G. Nestor, Matthew E. Spotnitz, Ali Soroush, Thomas Campion, Zhiyong Lu, Chunhua Weng, Yifan Peng

TL;DR

This study tackles the transparency and performance gap between open-source and proprietary LLMs in medical evidence summarization. It employs LoRA-based fine-tuning of open-source models PRIMERA, LongT5, and Llama-2 on the MedReview Cochrane-derived dataset to boost domain-specific summarization. Automatic metrics, PICO-based evaluation, and human plus GPT-4 simulated assessments show that fine-tuning yields meaningful gains, with LongT5-based systems nearing GPT-3.5-turbo performance and smaller tuned models outperforming larger zero-shot counterparts. The results support using fine-tuned open-source LLMs for reliable, transparent medical evidence summarization and guide model selection based on domain knowledge and resource constraints.

Abstract

Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37), and 15.82 in CHRF score (95% confidence interval: 13.89-16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.

Closing the gap between open-source and commercial large language models for medical evidence summarization

TL;DR

Abstract

Paper Structure (13 sections, 7 figures, 6 tables)

This paper contains 13 sections, 7 figures, 6 tables.

Introduction
Results
Comparison of different LLMs in automatic evaluations
Comparison between zero-shot LongT5-xl and fine-tuned LongT5-base
Qualitative evaluation
Discussion
Methods
Data Collection
Fine-tuning LLMs
Evaluation Metrics
PICO metrics
Human evaluation
GPT-4 evaluation

Figures (7)

Figure 1: Overview of topic distribution of the MedReview dataset and LLMs in this study. a, Topic distribution of the MedReview dataset. b, Choice of LLMs in this study.
Figure 1: User interface for collecting human feedback. The upper box shows an example summary produced by a studied LLM. The lower box displays the multiple choice question about the rationale of the human evaluators’ preference.
Figure 2: Performance of different medical evidence summarization systems in automatic evaluations. The p-value was calculated using a paired t-test to determine the statistical significance of the difference between the models. FT - fine-tuning; ZS - zero-shot learning; * - $p<0.05$; ** - $p<0.01$; *** - $p<0.001$; **** - $p<0.0001$; ns - Not significant.
Figure 2: Pearson Correlation Coefficients (r) among evaluation metrics. The natural language generation (NLG) metrics have a strongly positive correlation between each other ($r>0.68$). The PICO metrics have a moderate positive correlation with NLG metrics, ($0.15<r<0.46$). Recall that NLG metrics focus on lexical similarity while PICO metrics focus on coverage of key information (PICO elements in the summary).
Figure 3: Comparison between zero-shot LongT5-xl and fine-tuned LongT5-base. FT - fine-tuning; ZS - zero-shot learning; * - $p<0.05$; ** - $p<0.01$; *** - $p<0.001$; **** - $p<0.0001$; ns - Not significant.
...and 2 more figures

Closing the gap between open-source and commercial large language models for medical evidence summarization

TL;DR

Abstract

Closing the gap between open-source and commercial large language models for medical evidence summarization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)