Table of Contents
Fetching ...

Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization

Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerova, Nidhi Rohatgi, Poonam Hosamani, William Collins, Neera Ahuja, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, John Pauly, Akshay S. Chaudhari

TL;DR

The paper addresses the burden of clinical documentation and demonstrates that adapted large language models (LLMs), including GPT-4 with in-context learning, can match or surpass medical experts in clinical text summarization across radiology, patient questions, progress notes, and doctor–patient dialogue. The authors evaluate eight models using two adaptation methods (in-context learning and QLoRA) on six open datasets spanning four tasks, supplemented by a clinical reader study with ten physicians and a safety analysis. Results show that the best adapted LLMs are often equivalent or superior to expert summaries in completeness, correctness, and conciseness, with manageable safety risks and fewer fabricated details than experts in many cases. The work highlights the potential for integrating LLM-generated summaries into clinical workflows to reduce documentation load, while acknowledging limitations and the need for prospective clinical validation and governance considerations.

Abstract

Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP), their effectiveness on a diverse range of clinical summarization tasks remains unproven. In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Quantitative assessments with syntactic, semantic, and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with ten physicians evaluates summary completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care.

Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization

TL;DR

The paper addresses the burden of clinical documentation and demonstrates that adapted large language models (LLMs), including GPT-4 with in-context learning, can match or surpass medical experts in clinical text summarization across radiology, patient questions, progress notes, and doctor–patient dialogue. The authors evaluate eight models using two adaptation methods (in-context learning and QLoRA) on six open datasets spanning four tasks, supplemented by a clinical reader study with ten physicians and a safety analysis. Results show that the best adapted LLMs are often equivalent or superior to expert summaries in completeness, correctness, and conciseness, with manageable safety risks and fewer fabricated details than experts in many cases. The work highlights the potential for integrating LLM-generated summaries into clinical workflows to reduce documentation load, while acknowledging limitations and the need for prospective clinical validation and governance considerations.

Abstract

Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP), their effectiveness on a diverse range of clinical summarization tasks remains unproven. In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Quantitative assessments with syntactic, semantic, and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with ten physicians evaluates summary completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care.
Paper Structure (32 sections, 15 figures, 4 tables)

This paper contains 32 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Framework overview. First, we quantitatively evaluate each valid combination ($\times$) of LLM and adaptation method across four distinct summarization tasks comprising six datasets. We then conduct a clinical reader study in which ten physicians compare summaries of the best model/method against those of a medical expert. Lastly, we perform a safety analysis to quantify potential medical harm and to categorize types of fabricated information.
  • Figure 2: Left: Prompt anatomy. Each summarization task uses a slightly different instruction (Table \ref{['tab:datasets']}). Right: Effect of model temperature and expertise. We generally find better performance when (1) using lower temperature, i.e. generating less random output, as summarization tasks benefit more from truthfulness than creativity (2) assigning the model clinical expertise in the prompt. Output generated via GPT-3.5 on the Open-i radiology report dataset.
  • Figure 3: Alpaca vs. Med-Alpaca. Given that most data points are below the dashed lines denoting equivalence, we conclude that Med-Alpaca's fine-tuning with medical Q&A data results in worse performance for our clinical summarization tasks. See Section \ref{['sec:results_quant_eval']} for further discussion. Note that each data point corresponds to the average score of $s=250$ samples for a given experimental configuration, i.e. {dataset $\times$$m$ in-context examples}.
  • Figure 4: One in-context example (ICL) vs. QLoRA across open-source models on Open-i radiology reports. FLAN-T5 achieves best performance on both methods for this dataset. While QLoRA typically outperforms ICL with the better models (FLAN-T5, Llama-2), this relationship reverses given sufficient in-context examples (Figure \ref{['fig:grid-of-graphs']}). Figure \ref{['fig:icl-v-lora-chq']} contains similar results with patient health questions.
  • Figure 5: MEDCON scores vs. number of in-context examples across models and datasets. We also include the best model fine-tuned with QLoRA (FLAN-T5) as a horizontal dashed line for valid datasets. Zero-shot prompting (0 examples) often yields considerably inferior results, underscoring the need for adaptation methods. Note the allowable number of in-context examples varies significantly by model and dataset. See Figure \ref{['fig:grid-of-graphs']} for results across all four metrics.
  • ...and 10 more figures