WisPerMed at "Discharge Me!": Advancing Text Generation in Healthcare with Large Language Models, Dynamic Expert Selection, and Priming Techniques on MIMIC-IV

Hendrik Damm; Tabea M. G. Pakull; Bahadır Eryılmaz; Helmut Becker; Ahmad Idrissi-Yaghir; Henning Schäfer; Sergej Schultenkämper; Christoph M. Friedrich

WisPerMed at "Discharge Me!": Advancing Text Generation in Healthcare with Large Language Models, Dynamic Expert Selection, and Priming Techniques on MIMIC-IV

Hendrik Damm, Tabea M. G. Pakull, Bahadır Eryılmaz, Helmut Becker, Ahmad Idrissi-Yaghir, Henning Schäfer, Sergej Schultenkämper, Christoph M. Friedrich

TL;DR

The paper tackles the administrative burden of electronic health record documentation by automating the generation of the Brief Hospital Course and Discharge Instructions in Discharge Summaries from MIMIC-IV. It advances a multi-faceted approach combining few-shot learning, instruction tuning, MIMIC-SID section identification, and Dynamic Expert Selection (DES), with priming from the Asclepius clinical notes dataset. The strongest result, DES 5, achieved the top overall score of $0.332$, illustrating the value of generating multiple outputs and selecting the best via data-driven criteria; priming and longer-context models also substantially improved performance. These findings suggest that state-of-the-art LLM methods, when augmented with expert-selection and domain-specific priming, can meaningfully reduce clinician workload while maintaining documentation quality, pointing to practical pathways for integrating automated DS generation into clinical workflows.

Abstract

This study aims to leverage state of the art language models to automate generating the "Brief Hospital Course" and "Discharge Instructions" sections of Discharge Summaries from the MIMIC-IV dataset, reducing clinicians' administrative workload. We investigate how automation can improve documentation accuracy, alleviate clinician burnout, and enhance operational efficacy in healthcare facilities. This research was conducted within our participation in the Shared Task Discharge Me! at BioNLP @ ACL 2024. Various strategies were employed, including few-shot learning, instruction tuning, and Dynamic Expert Selection (DES), to develop models capable of generating the required text sections. Notably, utilizing an additional clinical domain-specific dataset demonstrated substantial potential to enhance clinical language processing. The DES method, which optimizes the selection of text outputs from multiple predictions, proved to be especially effective. It achieved the highest overall score of 0.332 in the competition, surpassing single-model outputs. This finding suggests that advanced deep learning methods in combination with DES can effectively automate parts of electronic health record documentation. These advancements could enhance patient care by freeing clinician time for patient interactions. The integration of text selection strategies represents a promising avenue for further research.

WisPerMed at "Discharge Me!": Advancing Text Generation in Healthcare with Large Language Models, Dynamic Expert Selection, and Priming Techniques on MIMIC-IV

TL;DR

, illustrating the value of generating multiple outputs and selecting the best via data-driven criteria; priming and longer-context models also substantially improved performance. These findings suggest that state-of-the-art LLM methods, when augmented with expert-selection and domain-specific priming, can meaningfully reduce clinician workload while maintaining documentation quality, pointing to practical pathways for integrating automated DS generation into clinical workflows.

Abstract

Paper Structure (27 sections, 8 figures, 7 tables)

This paper contains 27 sections, 8 figures, 7 tables.

Introduction
Dataset
Evaluation
Relevance
Factuality
Readability
Methods
Few-Shot learning
Instruction Tuning
MIMIC Section Identification
Hyperparameters
Dynamic Expert Selection
DES 1
DES 2
DES 3
...and 12 more sections

Figures (8)

Figure 1: This workflow, exemplified by DI, is applied to BHC in the same way. With MIMIC-SID the dataset is divided into up to 50 sections. For each training section, the average BERTScore is computed using the target text as a reference. The sections are then ranked from highest to lowest BERTScore, and this ranking is applied to both the training and testing DS. The ranked training dataset is used to train the Llama-3-8B-I model. Subsequently, the ranked testing dataset is presented to the model in the form of prompts to generate DI outputs.
Figure 2: Heatmap of the Pearson correlations between pre-calculated scores and the overall score on the validation dataset. The pre-calculated scores include factuality scores (SummaC, AlignScore, MEDCON and METEOR), which are calculated for the generated targets of the Mistralv2 + Asclepius model with the whole DS as the reference, and readability scores (FKGL, DCRS and CLI).
Figure 3: Example of repetitive and hallucinated DI output generated by Llama-3-8B-I. The words hyperglycemia and hypocalcemia are very similar but only one of them should be in the generated targets. The other one was not mentioned in the DS.
Figure 4: Discharge Instruction Prompt for Few-Shot learning with WizradLM-2.
Figure 5: Brief Hospital Course Prompt for Few-Shot learning with WizardLM-2.
...and 3 more figures

WisPerMed at "Discharge Me!": Advancing Text Generation in Healthcare with Large Language Models, Dynamic Expert Selection, and Priming Techniques on MIMIC-IV

TL;DR

Abstract

WisPerMed at "Discharge Me!": Advancing Text Generation in Healthcare with Large Language Models, Dynamic Expert Selection, and Priming Techniques on MIMIC-IV

Authors

TL;DR

Abstract

Table of Contents

Figures (8)