Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

Sarvesh Soni; Dina Demner-Fushman

Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

Sarvesh Soni, Dina Demner-Fushman

TL;DR

Automated evaluation can distinguish good versus bad AI responses to patient questions about hospitalization by leveraging a large-scale grounded QA setup over EHR notes and clinician-authored references. The study analyzes 100 cases and 28 systems, with 8400 human judgments across three dimensions, and demonstrates that ground-truth answers anchored evaluation yields strong alignment for answers-question and uses-knowledge when using clinician references. Dimension-specific metric performance shows semantic, reference-based metrics outperform lexical ones for several tasks, while ranking-based approaches (Pyramid, MACE) closely track human judgments, enabling scalable system benchmarking. The findings support rapid, low-cost benchmarking and error analysis to improve AI-assisted patient-clinician communication while highlighting the need to carefully design evaluation pipelines that separate evidence selection from narrative generation.

Abstract

Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched expert ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.

Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

TL;DR

Abstract

Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)