HEAD-QA: A Healthcare Dataset for Complex Reasoning
David Vilares, Carlos Gómez-Rodríguez
TL;DR
HEAD-QA introduces a domain-specific, multilingual multi-choice QA benchmark drawn from Spanish healthcare exams to probe complex medical reasoning. The authors compare monolingual Spanish and cross-lingual English setups using both information retrieval baselines and neural readers (DrQA, BiDAF, DGEM, Decompatt), highlighting a gap between machine and human performance. Results show cross-lingual IR often outperforms monolingual IR, while neural methods struggle with long, technical questions, underscoring the need for improved information extraction and reasoning in domain-specific QA. The dataset challenges current architectures and offers a valuable testbed for advancing multilingual, domain-focused QA research, with potential extensions to open-domain settings and dataset expansion.
Abstract
We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.
