Table of Contents
Fetching ...

Automatic Evaluation of Healthcare LLMs Beyond Question-Answering

Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla

TL;DR

This paper examines how to evaluate healthcare LLMs beyond traditional question answering by comparing open-ended and close-ended benchmarks, and by introducing CareQA as a dual-format benchmark. It shows that open-ended and MCQA evaluations often diverge, highlighting the need for broad, task-specific assessment. A three-cluster taxonomy of open-ended metrics is revealed, along with observations on resilience to rephrasing and self-consistency. To address gaps in factuality assessment for open-ended outputs, the authors propose Relaxed Perplexity and validate it on medical datasets, suggesting practical paths for more robust healthcare LLM evaluation with real-world impact.

Abstract

Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality of responses but lack expressiveness. Open-ended capture the model's capacity to produce discourse responses but are harder to assess for correctness. These two approaches are commonly used, either independently or together, though their relationship remains poorly understood. This work is focused on the healthcare domain, where both factuality and discourse matter greatly. It introduces a comprehensive, multi-axis suite for healthcare LLM evaluation, exploring correlations between open and close benchmarks and metrics. Findings include blind spots and overlaps in current methodologies. As an updated sanity check, we release a new medical benchmark --CareQA-- with both open and closed variants. Finally, we propose a novel metric for open-ended evaluations -- Relaxed Perplexity -- to mitigate the identified limitations.

Automatic Evaluation of Healthcare LLMs Beyond Question-Answering

TL;DR

This paper examines how to evaluate healthcare LLMs beyond traditional question answering by comparing open-ended and close-ended benchmarks, and by introducing CareQA as a dual-format benchmark. It shows that open-ended and MCQA evaluations often diverge, highlighting the need for broad, task-specific assessment. A three-cluster taxonomy of open-ended metrics is revealed, along with observations on resilience to rephrasing and self-consistency. To address gaps in factuality assessment for open-ended outputs, the authors propose Relaxed Perplexity and validate it on medical datasets, suggesting practical paths for more robust healthcare LLM evaluation with real-world impact.

Abstract

Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality of responses but lack expressiveness. Open-ended capture the model's capacity to produce discourse responses but are harder to assess for correctness. These two approaches are commonly used, either independently or together, though their relationship remains poorly understood. This work is focused on the healthcare domain, where both factuality and discourse matter greatly. It introduces a comprehensive, multi-axis suite for healthcare LLM evaluation, exploring correlations between open and close benchmarks and metrics. Findings include blind spots and overlaps in current methodologies. As an updated sanity check, we release a new medical benchmark --CareQA-- with both open and closed variants. Finally, we propose a novel metric for open-ended evaluations -- Relaxed Perplexity -- to mitigate the identified limitations.

Paper Structure

This paper contains 27 sections, 13 equations, 18 figures, 20 tables.

Figures (18)

  • Figure 1: Correlation between the weighted average accuracy from the MCQA benchmarks and all other close-ended and open-ended tasks and metrics. These results correspond to the smaller models.
  • Figure 2: Mean variance distributions across different runs and averaged across models using the CareQA-Open dataset. Closer to 0 means more self-consistent.
  • Figure 3: CareQA example from Medicine category.
  • Figure 4: Iterations with human evaluators to create the CareQA dataset in English, including both open and closed versions.
  • Figure 5: Category distribution per Category and Year (CareQA close-ended)
  • ...and 13 more figures