Table of Contents
Fetching ...

DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries

Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto

TL;DR

DistillNote introduces a scalable, task-based framework to evaluate LLM-generated clinical note summaries by directly measuring how well compressed notes retain downstream diagnostic signal. Using MIMIC-IV admission notes and heart failure prediction as the testbed, the authors compare one-step, divide-and-conquer, and distilled summarization strategies across three LLMs, evaluated with LLM-as-judge, clinician validation, and HF prediction performance. The study finds that substantial compression preserves most predictive utility (AUROC and AUPRC remain high; Distilled achieves the greatest compression with modest losses), highlighting important compression-to-performance tradeoffs and the value of functional evaluation for deployment decisions. The framework is adaptable to other clinical tasks and domains, offering a scalable approach to assessing the real-world utility and safety of AI-generated clinical summaries. Overall, DistillNote provides a principled, outcome-focused method to guide the integration of AI summaries into clinical workflows while balancing efficiency and diagnostic fidelity.

Abstract

Large language models (LLMs) are increasingly used to generate summaries from clinical notes. However, their ability to preserve essential diagnostic information remains underexplored, which could lead to serious risks for patient care. This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much prediction signal is retained. We generated over 192,000 LLM summaries from MIMIC-IV clinical notes with increasing compression rates: standard, section-wise, and distilled section-wise. Heart failure diagnosis was chosen as the prediction task, as it requires integrating a wide range of clinical signals. LLMs were fine-tuned on both the original notes and their summaries, and their diagnostic performance was compared using the AUROC metric. We contrasted DistillNote's results with evaluations from LLM-as-judge and clinicians, assessing consistency across different evaluation methods. Summaries generated by LLMs maintained a strong level of heart failure diagnostic signal despite substantial compression. Models trained on the most condensed summaries (about 20 times smaller) achieved an AUROC of 0.92, compared to 0.94 with the original note baseline (97 percent retention). Functional evaluation provided a new lens for medical summary assessment, emphasizing clinical utility as a key dimension of quality. DistillNote introduces a new scalable, task-based method for assessing the functional utility of LLM-generated clinical summaries. Our results detail compression-to-performance tradeoffs from LLM clinical summarization for the first time. The framework is designed to be adaptable to other prediction tasks and clinical domains, aiding data-driven decisions about deploying LLM summarizers in real-world healthcare settings.

DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries

TL;DR

DistillNote introduces a scalable, task-based framework to evaluate LLM-generated clinical note summaries by directly measuring how well compressed notes retain downstream diagnostic signal. Using MIMIC-IV admission notes and heart failure prediction as the testbed, the authors compare one-step, divide-and-conquer, and distilled summarization strategies across three LLMs, evaluated with LLM-as-judge, clinician validation, and HF prediction performance. The study finds that substantial compression preserves most predictive utility (AUROC and AUPRC remain high; Distilled achieves the greatest compression with modest losses), highlighting important compression-to-performance tradeoffs and the value of functional evaluation for deployment decisions. The framework is adaptable to other clinical tasks and domains, offering a scalable approach to assessing the real-world utility and safety of AI-generated clinical summaries. Overall, DistillNote provides a principled, outcome-focused method to guide the integration of AI summaries into clinical workflows while balancing efficiency and diagnostic fidelity.

Abstract

Large language models (LLMs) are increasingly used to generate summaries from clinical notes. However, their ability to preserve essential diagnostic information remains underexplored, which could lead to serious risks for patient care. This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much prediction signal is retained. We generated over 192,000 LLM summaries from MIMIC-IV clinical notes with increasing compression rates: standard, section-wise, and distilled section-wise. Heart failure diagnosis was chosen as the prediction task, as it requires integrating a wide range of clinical signals. LLMs were fine-tuned on both the original notes and their summaries, and their diagnostic performance was compared using the AUROC metric. We contrasted DistillNote's results with evaluations from LLM-as-judge and clinicians, assessing consistency across different evaluation methods. Summaries generated by LLMs maintained a strong level of heart failure diagnostic signal despite substantial compression. Models trained on the most condensed summaries (about 20 times smaller) achieved an AUROC of 0.92, compared to 0.94 with the original note baseline (97 percent retention). Functional evaluation provided a new lens for medical summary assessment, emphasizing clinical utility as a key dimension of quality. DistillNote introduces a new scalable, task-based method for assessing the functional utility of LLM-generated clinical summaries. Our results detail compression-to-performance tradeoffs from LLM clinical summarization for the first time. The framework is designed to be adaptable to other prediction tasks and clinical domains, aiding data-driven decisions about deploying LLM summarizers in real-world healthcare settings.

Paper Structure

This paper contains 45 sections, 5 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Overview of Distillnote. We introduce evaluating clinical LLM summary quality at scale based on functional utility, i.e. using summaries to diagnose heart failure, comparing the outputs with an LLM judge and clinician evaluations. We observe retention of most predictive performance with substantial text compression, indicating LLM summaries preserve critical diagnostic signals.
  • Figure 2: LLM-as-judge scores across summarization approaches. All fall in the "adequate" to "very good" range. One-step scores highest on relevance and actionability, while Distilled shows higher factuality. Standard deviations indicate scoring uncertainty.
  • Figure 3: Minimal loss in AUROC despite compression. Summarization strategies yield AUROC scores within 1.2--4.0% of full-note baseline, even at 79% text reduction. F = Full note (0.939 AUROC, 412 average words), O = One-step (0.927, 262 average words), S = Structured (0.921, 195 average words), D = Distilled (0.900, 87 average words).
  • Figure 4: LLM-as-judge raw scores per summarization approach and metric.
  • Figure 5: Heatmap of scores per summarization approach and metric.
  • ...and 1 more figures