Table of Contents
Fetching ...

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan

TL;DR

The paper tackles the challenge of evaluating AI-generated clinical notes by transforming real user feedback into grounded, binary checklists that can be enforced by LLM evaluators. It introduces an end-to-end pipeline that generates, refines, and optimizes checklists using data from over 21,000 de-identified encounters, expert ratings, and reference notes. The resulting feedback-driven checklist outperforms a baseline in coverage, diversity, enforceability, predictive power, and robustness to perturbations, and it aligns better with clinician preferences. This approach offers a scalable, interpretable evaluation tool for deployed AI scribes, with future work aimed at expanding to additional note sections and incorporating human studies.

Abstract

AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters (prepared in accordance with the HIPAA safe harbor standard) from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms a baseline approach in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist's robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, our checklist offers a practical tool for flagging notes that may fall short of our defined quality standards.

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

TL;DR

The paper tackles the challenge of evaluating AI-generated clinical notes by transforming real user feedback into grounded, binary checklists that can be enforced by LLM evaluators. It introduces an end-to-end pipeline that generates, refines, and optimizes checklists using data from over 21,000 de-identified encounters, expert ratings, and reference notes. The resulting feedback-driven checklist outperforms a baseline in coverage, diversity, enforceability, predictive power, and robustness to perturbations, and it aligns better with clinician preferences. This approach offers a scalable, interpretable evaluation tool for deployed AI scribes, with future work aimed at expanding to additional note sections and incorporating human studies.

Abstract

AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters (prepared in accordance with the HIPAA safe harbor standard) from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms a baseline approach in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist's robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, our checklist offers a practical tool for flagging notes that may fall short of our defined quality standards.

Paper Structure

This paper contains 53 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Example checklist questions for the Assessment and Plan section of a clinical note. The checklist score consists of the proportion of satisfied questions.
  • Figure 2: Proposed end-to-end pipeline.
  • Figure 3: The feedback checklist has a higher perturbation $\Delta$ than the baseline checklist. It is more robust against perturbations, particularly for missing information, organization, and redundancy/hallucination.
  • Figure 4: Correlation with human preference ratings is significant for our checklist ($p \le 0.05$ from a paired $t$-test, Cohen's $d=0.28$), but not for the baseline.
  • Figure 5: $\alpha$ values for the objective function, where $\alpha$ is the weight of the coverage term. The $\alpha$ value is set to $0.5$ for the final checklists, since it provides balance between coverage and diversity.
  • ...and 3 more figures