Table of Contents
Fetching ...

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

Sebastian Heineking, Jonas Probst, Daniel Steinbach, Martin Potthast, Harrisen Scells

TL;DR

This work tackles the challenge of evaluating open-ended health question-answering by introducing Normalized Rank Position (NRP), a ranking-based automatic metric that compares LLM-generated answers against expert-annotated documents. The method relies on ranking signals from annotated corpora and can operate offline, enabling scalable comparisons across prompting strategies and model sizes. Applied to the CLEF 2021 eHealth dataset, the approach demonstrates that larger, more sophisticated prompts generally yield higher NRP and that NRP correlates with human expert preferences (mean Kendall’s tau = $0.64$, 95% CI $[0.50,0.78]$). These findings suggest a practical, domain-aware avenue for evaluating open-ended health QA without exhaustively annotating every answer, though limitations include dependence on high-quality rankings and domain-specific data.

Abstract

Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Many evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required. One such domain is health, where misleading or incorrect answers can have a negative impact on a user's well-being. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking models trained on annotated document collections as a substitute for explicit relevance judgements and apply it to the CLEF 2021 eHealth dataset. In a user study, our method correlates with the preferences of a human expert (Kendall's $τ=0.64$). It is also consistent with previous findings in that the quality of generated answers improves with the size of the model and more sophisticated prompting strategies.

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

TL;DR

This work tackles the challenge of evaluating open-ended health question-answering by introducing Normalized Rank Position (NRP), a ranking-based automatic metric that compares LLM-generated answers against expert-annotated documents. The method relies on ranking signals from annotated corpora and can operate offline, enabling scalable comparisons across prompting strategies and model sizes. Applied to the CLEF 2021 eHealth dataset, the approach demonstrates that larger, more sophisticated prompts generally yield higher NRP and that NRP correlates with human expert preferences (mean Kendall’s tau = , 95% CI ). These findings suggest a practical, domain-aware avenue for evaluating open-ended health QA without exhaustively annotating every answer, though limitations include dependence on high-quality rankings and domain-specific data.

Abstract

Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Many evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required. One such domain is health, where misleading or incorrect answers can have a negative impact on a user's well-being. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking models trained on annotated document collections as a substitute for explicit relevance judgements and apply it to the CLEF 2021 eHealth dataset. In a user study, our method correlates with the preferences of a human expert (Kendall's ). It is also consistent with previous findings in that the quality of generated answers improves with the size of the model and more sophisticated prompting strategies.
Paper Structure (12 sections, 1 equation, 2 figures, 1 table)

This paper contains 12 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: NRP of LLM answers, averaged over the ten generated answers per question, grouped by prompt. Points are single answers. GPT-2 L and Llama-2 7B not shown.
  • Figure 2: Number of model parameters vs. NRP using the MultiMedQA prompt.