Table of Contents
Fetching ...

MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters

Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich

TL;DR

MeDiSumQA creates a patient-centric QA benchmark derived from MIMIC-IV discharge letters to enable standardized evaluation of LLMs in generating safe, understandable hospital information. The authors build an automated generation pipeline, followed by physician curation, resulting in 416 QA pairs across six categories, and they assess a range of general- and biomedical-domain LLMs using both automatic metrics (ROUGE, BERTScore, UMLS-F1) and manual physician evaluations. Findings show general-domain LLMs can rival or outperform biomedical-adapted models, with automatic metrics correlating with human judgments but still benefiting from human review to capture safety and patient-friendliness. The work highlights the importance of long-document handling, data contamination considerations, and public release via PhysioNet to accelerate patient-centered AI research and safer clinical communication.

Abstract

While increasing patients' access to medical documents improves medical care, this benefit is limited by varying health literacy levels and complex medical terminology. Large language models (LLMs) offer solutions by simplifying medical information. However, evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline combining LLM-based question-answer generation with manual quality checks. We use this dataset to evaluate various LLMs on patient-oriented question-answering. Our findings reveal that general-purpose LLMs frequently surpass biomedical-adapted models, while automated metrics correlate with human judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the development of LLMs to enhance patient understanding and ultimately improve care outcomes.

MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters

TL;DR

MeDiSumQA creates a patient-centric QA benchmark derived from MIMIC-IV discharge letters to enable standardized evaluation of LLMs in generating safe, understandable hospital information. The authors build an automated generation pipeline, followed by physician curation, resulting in 416 QA pairs across six categories, and they assess a range of general- and biomedical-domain LLMs using both automatic metrics (ROUGE, BERTScore, UMLS-F1) and manual physician evaluations. Findings show general-domain LLMs can rival or outperform biomedical-adapted models, with automatic metrics correlating with human judgments but still benefiting from human review to capture safety and patient-friendliness. The work highlights the importance of long-document handling, data contamination considerations, and public release via PhysioNet to accelerate patient-centered AI research and safer clinical communication.

Abstract

While increasing patients' access to medical documents improves medical care, this benefit is limited by varying health literacy levels and complex medical terminology. Large language models (LLMs) offer solutions by simplifying medical information. However, evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline combining LLM-based question-answer generation with manual quality checks. We use this dataset to evaluate various LLMs on patient-oriented question-answering. Our findings reveal that general-purpose LLMs frequently surpass biomedical-adapted models, while automated metrics correlate with human judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the development of LLMs to enhance patient understanding and ultimately improve care outcomes.

Paper Structure

This paper contains 21 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Generation process of MeDiSumQA. After identifying the discharge letter, we separate it from the main document and use an LLM to split it into sentences (1). Based on these sentences, we let an LLM generate matching questions (2). The resulting question-answer pairs were reviewed and curated by a physician, resulting in the the final MeDiSumQA dataset of 416 question-answer pairs (3). For inference, we provide LLMs with the discharge summary (without the bottom discharge letter) and pose the generated question. The model answer is then compared to the extracted ground truth answer (4).
  • Figure 2: Frequency of question-answer categories in MeDiSumQA.
  • Figure 3: Example of QA pairs in MeDiSumQA dataset.
  • Figure 4: Physicians’ evaluation of model generated answers on MeDiSumQA. Generated answers by Llama-3.1-8B-Instruct (green) and Mistral-7B-Instruct-v0.1 (red) were sorted by their average automatic evaluation scores and divided into 5 bins. From each bin, 10 examples per model were sampled and rated by a physician across Factuality, Brevity, Patient-Friendliness, Relevance, and Safety. Each subplot displays scores either between 1 and 5 [Factuality, Brevity, Patient-Friendliness, Relevance] or 0 and 1 [Safety].
  • Figure 5: Physician preferences for answers generated by Mistral-7B-Instruct-v0.1 (a) and Llama-3.1-8B-Instruct (b) and the ground truth answers.
  • ...and 3 more figures