Table of Contents
Fetching ...

A dataset and benchmark for hospital course summarization with adapted large language models

Asad Aali, Dave Van Veen, Yamin Ishraq Arefeen, Jason Hom, Christian Bluethgen, Eduardo Pontes Reis, Sergios Gatidis, Namuun Clifford, Joseph Daws, Arash S. Tehrani, Jangwon Kim, Akshay S. Chaudhari

TL;DR

A novel preprocessed dataset is introduced, the MIMIC-IV-BHC, encapsulating clinical note and BHC pairs to adapt LLMs for BHC synthesis from clinical notes, and an open-source benchmark of LLM performance in BHC synthesis from clinical notes is presented.

Abstract

Brief hospital course (BHC) summaries are clinical documents that summarize a patient's hospital stay. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as synthesizing BHCs from clinical notes have not been shown. We introduce a novel pre-processed dataset, the MIMIC-IV-BHC, encapsulating clinical note and brief hospital course (BHC) pairs to adapt LLMs for BHC synthesis. Furthermore, we introduce a benchmark of the summarization performance of two general-purpose LLMs and three healthcare-adapted LLMs. Using clinical notes as input, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to three open-source LLMs (Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5, GPT-4). We evaluate these LLMs across multiple context-length inputs using natural language similarity metrics. We further conduct a clinical study with five clinicians, comparing clinician-written and LLM-generated BHCs across 30 samples, focusing on their potential to enhance clinical decision-making through improved summary quality. We observe that the Llama2-13B fine-tuned LLM outperforms other domain-adapted models given quantitative evaluation metrics of BLEU and BERT-Score. GPT-4 with in-context learning shows more robustness to increasing context lengths of clinical note inputs than fine-tuned Llama2-13B. Despite comparable quantitative metrics, the reader study depicts a significant preference for summaries generated by GPT-4 with in-context learning compared to both Llama2-13B fine-tuned summaries and the original summaries, highlighting the need for qualitative clinical evaluation.

A dataset and benchmark for hospital course summarization with adapted large language models

TL;DR

A novel preprocessed dataset is introduced, the MIMIC-IV-BHC, encapsulating clinical note and BHC pairs to adapt LLMs for BHC synthesis from clinical notes, and an open-source benchmark of LLM performance in BHC synthesis from clinical notes is presented.

Abstract

Brief hospital course (BHC) summaries are clinical documents that summarize a patient's hospital stay. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as synthesizing BHCs from clinical notes have not been shown. We introduce a novel pre-processed dataset, the MIMIC-IV-BHC, encapsulating clinical note and brief hospital course (BHC) pairs to adapt LLMs for BHC synthesis. Furthermore, we introduce a benchmark of the summarization performance of two general-purpose LLMs and three healthcare-adapted LLMs. Using clinical notes as input, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to three open-source LLMs (Clinical-T5-Large, Llama2-13B, FLAN-UL2) and two proprietary LLMs (GPT-3.5, GPT-4). We evaluate these LLMs across multiple context-length inputs using natural language similarity metrics. We further conduct a clinical study with five clinicians, comparing clinician-written and LLM-generated BHCs across 30 samples, focusing on their potential to enhance clinical decision-making through improved summary quality. We observe that the Llama2-13B fine-tuned LLM outperforms other domain-adapted models given quantitative evaluation metrics of BLEU and BERT-Score. GPT-4 with in-context learning shows more robustness to increasing context lengths of clinical note inputs than fine-tuned Llama2-13B. Despite comparable quantitative metrics, the reader study depicts a significant preference for summaries generated by GPT-4 with in-context learning compared to both Llama2-13B fine-tuned summaries and the original summaries, highlighting the need for qualitative clinical evaluation.
Paper Structure (23 sections, 5 figures, 1 table)

This paper contains 23 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: A full-length clinical note with its respective clinician-written and LLM-generated BHC (with feedback). This note was sampled from the MIMIC-IV-BHC 0 to 1,024 context range subset. "Summary 1" is the actual BHC written by a clinician, and "Summary 2" is the BHC generated by GPT-4 adapted through in-context learning (ICL). The summaries were presented for feedback to the reader randomly, without specifying “clinician” or “GPT-4”.
  • Figure 2: Overall schematic of our study. We evaluate a variety of models, including open-source models containing up to 20 billion parameters, and larger-scale proprietary models. Each model is adapted to the summarization task using the adaptation strategies displayed (except QLoRA is not applied to GPT-3.5 and GPT-4). We evaluate each model's performance by comparing its outputs with expert clinician summaries. Each model paired with the adaptation strategy is evaluated using quantitative similarity metrics. Finally, we perform a clinical study where five clinicians rate three summaries (randomized order) for every summarization task: best-performing open-source model, best-performing proprietary model, and clinician-written. The $^{*}$ indicates GPT-4's maximum context length at the time of experimentation (later increased to 128,000).
  • Figure 3: Quantitative metric results for each choice of model, across increasing domain-adaptation strategies. In summary, QLoRA as an adaptation strategy outperforms other adaptation methods. Specifically, QLoRA Llama2-13B outperforms other models in BLEU score, while achieving comparable performance to Clinical-T5-Large in BERT-Score and ROUGE-L.
  • Figure 4: a) Quantitative evaluation metrics across increasing input context lengths. GPT-4 shows consistency in performance whereas Llama2-13B shows a drop in summarization with increasing context length inputs. b) Context size analysis for QLoRA Llama2-13B (in/out-of-distribution), where each item on the y-axis displays an independent model fine-tuned on samples from a specific context length range. The summarization performance of the combined model trained on 0 - 4,096 context length inputs outperforms other models with longer input clinical notes at inference (more than 1,024 tokens).
  • Figure 5: a) Violin plot showing results from the reader study with five clinicians. Clinicians exhibit a strong preference for in-context GPT-4 (adapted large-scale proprietary LLM) summaries over QLoRA Llama2-13B (adapted open-source LLM) and clinician-written summaries with statistical significance by the Wilcoxon signed-rank test (*$p < 0.001$) across each attribute. (NS: Not Significant). b) Plot showing common themes derived from detailed reader comments. This sub-analysis reiterates the preference for in-context GPT-4 while exhibiting comparable performance of open-source models and clinicians.