Table of Contents
Fetching ...

Extrinsically-Focused Evaluation of Omissions in Medical Summarization

Elliot Schumacher, Daniel Rosenthal, Dhruv Naik, Varun Nair, Luladay Price, Geoffrey Tso, Anitha Kannan

TL;DR

This work focuses on using provider-patient history conversations to generate a subjective (a summary of the patient's history) as a case study, and proposes MED-OMIT as a metric to explore this challenge.

Abstract

Large language models (LLMs) have shown promise in safety-critical applications such as healthcare, yet the ability to quantify performance has lagged. An example of this challenge is in evaluating a summary of the patient's medical record. A resulting summary can enable the provider to get a high-level overview of the patient's health status quickly. Yet, a summary that omits important facts about the patient's record can produce a misleading picture. This can lead to negative consequences on medical decision-making. We propose MED-OMIT as a metric to explore this challenge. We focus on using provider-patient history conversations to generate a subjective (a summary of the patient's history) as a case study. We begin by discretizing facts from the dialogue and identifying which are omitted from the subjective. To determine which facts are clinically relevant, we measure the importance of each fact to a simulated differential diagnosis. We compare MED-OMIT's performance to that of clinical experts and find broad agreement We use MED-OMIT to evaluate LLM performance on subjective generation and find some LLMs (gpt-4 and llama-3.1-405b) work well with little effort, while others (e.g. Llama 2) perform worse.

Extrinsically-Focused Evaluation of Omissions in Medical Summarization

TL;DR

This work focuses on using provider-patient history conversations to generate a subjective (a summary of the patient's history) as a case study, and proposes MED-OMIT as a metric to explore this challenge.

Abstract

Large language models (LLMs) have shown promise in safety-critical applications such as healthcare, yet the ability to quantify performance has lagged. An example of this challenge is in evaluating a summary of the patient's medical record. A resulting summary can enable the provider to get a high-level overview of the patient's health status quickly. Yet, a summary that omits important facts about the patient's record can produce a misleading picture. This can lead to negative consequences on medical decision-making. We propose MED-OMIT as a metric to explore this challenge. We focus on using provider-patient history conversations to generate a subjective (a summary of the patient's history) as a case study. We begin by discretizing facts from the dialogue and identifying which are omitted from the subjective. To determine which facts are clinically relevant, we measure the importance of each fact to a simulated differential diagnosis. We compare MED-OMIT's performance to that of clinical experts and find broad agreement We use MED-OMIT to evaluate LLM performance on subjective generation and find some LLMs (gpt-4 and llama-3.1-405b) work well with little effort, while others (e.g. Llama 2) perform worse.
Paper Structure (21 sections, 1 equation, 13 figures, 5 tables)

This paper contains 21 sections, 1 equation, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Example GPT-4 generated subjective paired with the list of omitted facts and their weight. The facts are generated from the original patient-provider dialogue and their importance is scored using the MED-OMIT pipeline. See Appendix Figures \ref{['fig:example_short_chat']}, \ref{['fig:example_pt1']} and \ref{['fig:example_pt2']} for additional context.
  • Figure 2: Given a patient-provider dialogue (left), we compute a summary and use a fact extraction module to extract facts from the conversation. We use the extracted facts from the conversation to identify if any facts are omitted from the summary. We also compute a differential diagnosis using the conversation data.
  • Figure 3: Given the previous outputs of the diagnosis prediction and fact extraction modules, we cluster facts that either support or refute a diagnosis. We also categorize each fact w.r.t. each diagnosis. With the clustered & categorized facts and the previously computed fact omissions, we assign an importance and uniqueness score to each fact.
  • Figure 4: For each summary LLM, we calculate the mean of the number of MED-OMIT omissions (left) and the cumulative weight (right), with color indicating model family. A lower score indicates higher performance. See Appendix Table \ref{['tab:weights_and_counts']} for full results.
  • Figure 5: A Confusion Matrix for annotator agreement with GPT-4 for the Fact Omission task. The counts of agreement groups are shown in each cell -- e.g. the number of examples where gpt-4 selected No, and annotators selected Partially is 35. The overall agreement was 80%. Note that while we give annotators three labels to choose from, MED-OMIT only uses a binary judgment (and excludes the "Partially" option). Therefore, we count annotators selecting "Partially" as correct if MED-OMIT selects "Yes"). We believe work capturing the degree of omission would provide further insight.
  • ...and 8 more figures