Table of Contents
Fetching ...

Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

Sarvesh Soni, Dina Demner-Fushman

TL;DR

Automated evaluation can distinguish good versus bad AI responses to patient questions about hospitalization by leveraging a large-scale grounded QA setup over EHR notes and clinician-authored references. The study analyzes 100 cases and 28 systems, with 8400 human judgments across three dimensions, and demonstrates that ground-truth answers anchored evaluation yields strong alignment for answers-question and uses-knowledge when using clinician references. Dimension-specific metric performance shows semantic, reference-based metrics outperform lexical ones for several tasks, while ranking-based approaches (Pyramid, MACE) closely track human judgments, enabling scalable system benchmarking. The findings support rapid, low-cost benchmarking and error analysis to improve AI-assisted patient-clinician communication while highlighting the need to carefully design evaluation pipelines that separate evidence selection from narrative generation.

Abstract

Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched expert ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.

Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

TL;DR

Automated evaluation can distinguish good versus bad AI responses to patient questions about hospitalization by leveraging a large-scale grounded QA setup over EHR notes and clinician-authored references. The study analyzes 100 cases and 28 systems, with 8400 human judgments across three dimensions, and demonstrates that ground-truth answers anchored evaluation yields strong alignment for answers-question and uses-knowledge when using clinician references. Dimension-specific metric performance shows semantic, reference-based metrics outperform lexical ones for several tasks, while ranking-based approaches (Pyramid, MACE) closely track human judgments, enabling scalable system benchmarking. The findings support rapid, low-cost benchmarking and error analysis to improve AI-assisted patient-clinician communication while highlighting the need to carefully design evaluation pipelines that separate evidence selection from narrative generation.

Abstract

Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched expert ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.

Paper Structure

This paper contains 10 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Kendall's $\tau$ correlations between system rankings induced by automated metrics (rows) and human‑judgment rankings under two rank‑aggregation schemes (Pyramid and MACE) across the three evaluation dimensions. Cells encode $\tau$ (-1 to +1; blue $\rightarrow$ red). Rows are grouped and color-coded by metric type (legend) and are ordered by $\tau$ for answers‑question under the Pyramid scheme. Metrics tagged (human) use clinician‑authored answers as ground truth, whereas metrics without this tag use clinical note sentences annotated as essential for comparison.
  • Figure 2: Agreement between Pyramid and MACE rankings. Each panel (rows: answers‑question, uses‑evidence, uses‑knowledge; columns: MACE, MACE with 2 annotations, MACE with 1 annotation) plots MACE rank (y; 1=best) against Pyramid rank (x; 1=best) for n=28 systems; the dashed line marks perfect agreement. Kendall's $\tau$ is annotated in each panel.
  • Figure 3: Case‑level performance across systems using the Pyramid scheme. Top row: histograms of across‑system mean scores for each case on the three dimensions (answers‑question, uses‑evidence, uses‑knowledge); vertical dashed lines mark the first (Q1) and third (Q3) quartiles. Bottom row: per‑case mean score (higher = easier) versus across‑system standard deviation, with points colored by difficulty level (scatter plot legend). Cases are labeled as difficult if their mean falls at or below Q1, easy if it falls at or above Q3, and moderate otherwise.