Table of Contents
Fetching ...

HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

Alexis Correa-Guillén, Carlos Gómez-Rodríguez, David Vilares

TL;DR

HEAD-QA v2 expands a Spanish/English biomedical reasoning benchmark to 12,751 questions drawn from a decade of official exams, enabling robust evaluation of reasoning in multilingual, domain-specific contexts. The authors systematically benchmark open-source LLMs (Llama 3.1, Mistral, Mixtral) using prompting, retrieval-augmented generation, and log-probability-based answer selection, including a formalization of $P(A_i)$ in $log$-space to support short, unambiguous outputs. Key findings show performance largely hinges on model scale and intrinsic reasoning ability, with complex inference strategies providing limited, inconsistent gains and sometimes hindering accuracy. The work provides practical baselines and insights for biomedical reasoning research, highlighting that scaling and model capability outweigh added inference complexity for this domain, while offering multilingual extensions and a public dataset for reproducibility and future study. $P(A_i)$ and $\log P(A_i)$ are used to contrast generation-based vs. probability-based selection, illustrating the role of explicit probabilistic reasoning in multiple-choice tasks while underscoring the superior utility of scale-driven generation in HEAD-QA v2.

Abstract

We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.

HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

TL;DR

HEAD-QA v2 expands a Spanish/English biomedical reasoning benchmark to 12,751 questions drawn from a decade of official exams, enabling robust evaluation of reasoning in multilingual, domain-specific contexts. The authors systematically benchmark open-source LLMs (Llama 3.1, Mistral, Mixtral) using prompting, retrieval-augmented generation, and log-probability-based answer selection, including a formalization of in -space to support short, unambiguous outputs. Key findings show performance largely hinges on model scale and intrinsic reasoning ability, with complex inference strategies providing limited, inconsistent gains and sometimes hindering accuracy. The work provides practical baselines and insights for biomedical reasoning research, highlighting that scaling and model capability outweigh added inference complexity for this domain, while offering multilingual extensions and a public dataset for reproducibility and future study. and are used to contrast generation-based vs. probability-based selection, illustrating the role of explicit probabilistic reasoning in multiple-choice tasks while underscoring the superior utility of scale-driven generation in HEAD-QA v2.

Abstract

We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.

Paper Structure

This paper contains 34 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 2: A HEAD-QA v2 question in JSON format.
  • Figure 3: Question length distribution by year.
  • Figure 5: HEAD-QA v2 question encoded as a single input sequence.
  • Figure 6: Zero-shot prompt. The example, for Llama-3.1, shows the use of headers and special tokens that delimit user–assistant interactions and metadata as specified by the model architecture.
  • Figure 7: Example of a few-shot prompt with samples. Case shown for the Llama-3.1-8B model.
  • ...and 7 more figures