Table of Contents
Fetching ...

LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints

Federica Bologna, Tiffany Pan, Matthew Wilkens, Yue Guo, Lucy Lu Wang

TL;DR

LongQAEval tackles the challenge of evaluating long-form clinical QA under resource constraints by introducing a dual design framework (coarse and fine-grained) to assess correctness, relevance, and safety. It analyzes a 300-question real-patient dataset with physician and LLM-generated answers, plus LLM-as-judge assessments, to reveal that fine-grained evaluation boosts IAA for factual correctness while partial fine-grained annotation can retain reliability at lower cost. The study demonstrates that GPT-4 and Llama-3.1-Instruct-405B approach physician-level performance on correctness and relevance, though safety remains a persistent bottleneck, and that annotation design should be tailored to the dimension of interest. The findings yield practical recommendations for reliable, resource-conscious evaluation of clinical QA systems and suggest LLMs can complement expert judgments under budget and expertise constraints.

Abstract

Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints

TL;DR

LongQAEval tackles the challenge of evaluating long-form clinical QA under resource constraints by introducing a dual design framework (coarse and fine-grained) to assess correctness, relevance, and safety. It analyzes a 300-question real-patient dataset with physician and LLM-generated answers, plus LLM-as-judge assessments, to reveal that fine-grained evaluation boosts IAA for factual correctness while partial fine-grained annotation can retain reliability at lower cost. The study demonstrates that GPT-4 and Llama-3.1-Instruct-405B approach physician-level performance on correctness and relevance, though safety remains a persistent bottleneck, and that annotation design should be tailored to the dimension of interest. The findings yield practical recommendations for reliable, resource-conscious evaluation of clinical QA systems and suggest LLMs can complement expert judgments under budget and expertise constraints.

Abstract

Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

Paper Structure

This paper contains 26 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Average inter-annotator agreement (IAA) (Randolph’s $\kappa$) for expert annotators across annotation groups for correctness, relevance, and safety.
  • Figure 2: Correlation between partial fine-grained annotations and full fine-grained annotations and variance when partially annotating an answer (left). Inter-annotator variance when partially annotating an answer compared against coarse annotations. Confidence intervals shown are 95% and computed across 100 random subsets.
  • Figure 3: Comparison of system-level average coarse versus fine-grained ratings for correctness, relevance, and safety. Scores range from 0 to 1, with 1 indicating optimal performance.
  • Figure 4: Average inter-annotator agreement (Randolph's $\kappa$) between average expert ratings and LLM-as-judge ratings for correctness, relevance, and safety.
  • Figure 5: Average intra-rater reliability (IRR) (percent agreement) across annotators for correctness, relevance, and safety.