Table of Contents
Fetching ...

A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models

Stefan Hegselmann, Shannon Zejiang Shen, Florian Gierse, Monica Agrawal, David Sontag, Xiaoyi Jiang

TL;DR

This work releases a rigorous labeling protocol for errors in medical texts and releases a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries, showing that fine-tuning on hallucination-free data effectively reduces hallucinations.

Abstract

Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.

A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models

TL;DR

This work releases a rigorous labeling protocol for errors in medical texts and releases a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries, showing that fine-tuning on hallucination-free data effectively reduces hallucinations.

Abstract

Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.
Paper Structure (33 sections, 12 figures, 17 tables)

This paper contains 33 sections, 12 figures, 17 tables.

Figures (12)

  • Figure 1: We developed a protocol for annotating hallucinations in medical text. Following this protocol, two medical experts labeled hallucinations in 100 doctor-written (Hallucinations-MIMIC-DI) and 100 LLM-generated patient summaries (Hallucinations-Generated-DI). We used the labeled hallucinations in the doctor-written summaries to derive two additional datasets by replacing or removing hallucinations (Cleaned) and by further improving the language (Cleaned & Improved). We used these two datasets for our data-centric hallucination reduction and qualitative experiments.
  • Figure 2: A synthetic MIMIC example labeled with the developed annotation protocol for hallucinations. The protocol was adapted from thomson_gold_2020 and we used eleven different labels.
  • Figure 3: Patient summaries generated by Llama 70B fine-tuned on 100 Original and 100 Cleaned examples given the synthetic context in Figure \ref{['fig:labelling_example']} with annotated hallucinations according to our protocol. These are two of the five models included in the data-centric hallucination reduction experiments.
  • Figure 4: Qualitative evaluation of Llama 70B fine-tuned on all 100 examples of Cleaned & Improved and GPT-4 5-shot prompted with 5 random examples of Cleaned & Improved. We compared them to the original MIMIC summaries, LED-large fine-tuned on MIMIC-IV-Note-Ext-DI-BHC-Anno, and GPT-4 0-shot. Two medical experts evaluated 20 summaries from each of the five models.
  • Figure 5: The preprocessing steps performed on MIMIC-IV-Note to obtain the datasets MIMIC-IV-Note-Ext-DI(-BHC). The goal was obtain diverse and free text discharge instructions (DI) as patient summaries.
  • ...and 7 more figures