Table of Contents
Fetching ...

Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data

Yuhao Chen, Zhimu Wang, Bo Wen, Farhana Zulkernine

TL;DR

This paper addresses the need for systematic evaluation of open-source LLMs on domain-specific medical text summarization. It introduces a GPT-4 based assessor and an adversarial, pairwise evaluation framework that reduces comparisons to $n-1$ per data sample while correcting for order bias. Using Llama2-70B-chat-hf and Mistral-7B-Instruct-v0.1 across three summarization tasks drawn from MEDIQA and iCliniq datasets, the study finds Llama2-70B generally outperforms other open-source LLMs. The approach demonstrates a scalable method for model selection and quality control in digital health, enabling more reliable biomedical text summarization.

Abstract

Unstructured text in medical notes and dialogues contains rich information. Recent advancements in Large Language Models (LLMs) have demonstrated superior performance in question answering and summarization tasks on unstructured text data, outperforming traditional text analysis approaches. However, there is a lack of scientific studies in the literature that methodically evaluate and report on the performance of different LLMs, specifically for domain-specific data such as medical chart notes. We propose an evaluation approach to analyze the performance of open-source LLMs such as Llama2 and Mistral for medical summarization tasks, using GPT-4 as an assessor. Our innovative approach to quantitative evaluation of LLMs can enable quality control, support the selection of effective LLMs for specific tasks, and advance knowledge discovery in digital health.

Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data

TL;DR

This paper addresses the need for systematic evaluation of open-source LLMs on domain-specific medical text summarization. It introduces a GPT-4 based assessor and an adversarial, pairwise evaluation framework that reduces comparisons to per data sample while correcting for order bias. Using Llama2-70B-chat-hf and Mistral-7B-Instruct-v0.1 across three summarization tasks drawn from MEDIQA and iCliniq datasets, the study finds Llama2-70B generally outperforms other open-source LLMs. The approach demonstrates a scalable method for model selection and quality control in digital health, enabling more reliable biomedical text summarization.

Abstract

Unstructured text in medical notes and dialogues contains rich information. Recent advancements in Large Language Models (LLMs) have demonstrated superior performance in question answering and summarization tasks on unstructured text data, outperforming traditional text analysis approaches. However, there is a lack of scientific studies in the literature that methodically evaluate and report on the performance of different LLMs, specifically for domain-specific data such as medical chart notes. We propose an evaluation approach to analyze the performance of open-source LLMs such as Llama2 and Mistral for medical summarization tasks, using GPT-4 as an assessor. Our innovative approach to quantitative evaluation of LLMs can enable quality control, support the selection of effective LLMs for specific tasks, and advance knowledge discovery in digital health.
Paper Structure (12 sections, 3 equations, 1 figure, 1 table)